Building Real-Time Collaboration with WebSockets & Node.js - NextGenBeing Building Real-Time Collaboration with WebSockets & Node.js - NextGenBeing
Back to discoveries

Complete Solution: Building a Real-Time Collaboration Platform with WebSockets and Node.js

Learn how we built a production-grade real-time collaboration platform handling 500k concurrent users. From WebSocket architecture to horizontal scaling, Redis pub/sub, and conflict resolution—battle-tested patterns you won't find in the docs.

Mobile Development Premium Content 34 min read
Daniel Hartwell

Daniel Hartwell

Apr 24, 2026 60 views
Complete Solution: Building a Real-Time Collaboration Platform with WebSockets and Node.js
Photo by Rahul Mishra on Unsplash
Size:
Height:
📖 34 min read 📝 13,419 words 👁 Focus mode: ✨ Eye care:

Listen to Article

Loading...
0:00 / 0:00
0:00 0:00
Low High
0% 100%
⏸ Paused ▶️ Now playing... Ready to play ✓ Finished

Last year, our team at a document collaboration startup faced a problem that kept me up at night. We'd just landed a major enterprise client—a Fortune 500 company wanting to migrate 50,000 employees to our platform. Our existing WebSocket implementation, which had been humming along nicely with 5,000 concurrent users, started showing cracks immediately during load testing. Connection storms during morning login hours maxed out our single Node.js server. Users saw 3-5 second delays in seeing each other's edits. Worst of all, we discovered race conditions that occasionally corrupted documents when multiple people edited the same paragraph simultaneously.

I spent three months rebuilding our real-time infrastructure from scratch. We went from a single Express server with Socket.IO to a horizontally scalable architecture handling 500k concurrent connections across 40 servers. Along the way, I learned that most WebSocket tutorials and documentation skip the hard parts—the stuff that only breaks at scale.

Here's what I wish someone had told me before I started. This isn't another "hello world" Socket.IO tutorial. This is the production architecture, the conflict resolution algorithms, the Redis pub/sub patterns, and the operational nightmares we solved through trial and error.

Why Our First Architecture Failed Spectacularly

Our initial setup was textbook Socket.IO: a single Node.js server, in-memory storage for active connections, and basic event broadcasting. It looked something like this:

const express = require('express');
const http = require('http');
const socketIO = require('socket.io');

const app = express();
const server = http.createServer(app);
const io = socketIO(server);

// This seemed fine at first
const activeUsers = new Map();
const documentSessions = new Map();

io.on('connection', (socket) => {
  console.log('User connected:', socket.id);
  
  socket.on('join-document', (documentId) => {
    socket.join(documentId);
    socket.to(documentId).emit('user-joined', socket.id);
  });
  
  socket.on('edit', (data) => {
    // Broadcast to everyone in the document
    socket.to(data.documentId).emit('edit', data);
  });
});

server.listen(3000);

This worked beautifully in development. Five developers editing simultaneously? Perfect. Ten QA testers stress-testing? No problem. Then we hit production with real users.

The breaking point came on a Tuesday morning at 9:03 AM. Our enterprise client's employees all logged in simultaneously as they started their workday. Within 90 seconds, we had 12,000 connection attempts. Our single Node.js process maxed out at around 8,000 concurrent connections before the event loop started lagging. New connections took 15+ seconds to establish. The server's memory usage spiked from 400MB to 3.2GB. Then it crashed.

I spent that entire day firefighting. We spun up three more servers behind a load balancer, but that introduced a new problem I hadn't anticipated: users on different servers couldn't see each other's edits. When Alice on server-1 typed something, Bob on server-2 saw nothing. Our in-memory approach meant each server had its own isolated view of the world.

That's when I realized we needed to completely rethink our architecture.

The Architecture That Actually Scales

After researching how companies like Figma, Google Docs, and Notion handle real-time collaboration, I designed a new architecture with these core principles:

1. Stateless WebSocket servers - Any server can handle any connection. No sticky sessions required.

2. Redis as the central nervous system - All state lives in Redis. Servers are just dumb pipes that connect clients to Redis.

3. Pub/sub for cross-server communication - When server-1 receives an edit, it publishes to Redis. All other servers subscribed to that document receive the update and broadcast to their connected clients.

4. Operational Transformation for conflict resolution - When two users edit the same location simultaneously, we need algorithms to merge their changes intelligently.

Here's the high-level architecture diagram I drew on our whiteboard (and eventually presented to our CTO):

┌─────────────────────────────────────────────────────────────┐
│                         Load Balancer                        │
│                     (Round-robin routing)                    │
└──────────────┬──────────────┬──────────────┬────────────────┘
               │              │              │
       ┌───────▼──────┐ ┌────▼─────┐ ┌──────▼───────┐
       │ WS Server 1  │ │WS Server 2│ │ WS Server N  │
       │ (Node.js)    │ │(Node.js)  │ │ (Node.js)    │
       └──────┬───────┘ └────┬──────┘ └──────┬───────┘
              │              │               │
              └──────────────┼───────────────┘
                             │
                    ┌────────▼─────────┐
                    │  Redis Cluster   │
                    │  - Pub/Sub       │
                    │  - Session Store │
                    │  - Document Cache│
                    └──────────────────┘

Let me walk you through each component and the production-ready code.

Building the WebSocket Server Layer

The first major change was making our WebSocket servers completely stateless. Here's the new server structure:

// server.js
const express = require('express');
const http = require('http');
const socketIO = require('socket.io');
const Redis = require('ioredis');
const { createAdapter } = require('@socket.io/redis-adapter');

const app = express();
const server = http.createServer(app);

// Socket.IO with Redis adapter for cross-server communication
const io = socketIO(server, {
  cors: {
    origin: process.env.CLIENT_ORIGIN,
    credentials: true
  },
  transports: ['websocket', 'polling'],
  pingTimeout: 60000,
  pingInterval: 25000
});

// Redis clients for pub/sub
const pubClient = new Redis({
  host: process.env.REDIS_HOST,
  port: process.env.REDIS_PORT,
  password: process.env.REDIS_PASSWORD,
  retryStrategy: (times) => {
    const delay = Math.min(times * 50, 2000);
    return delay;
  }
});

const subClient = pubClient.duplicate();

// Connect Socket.IO to Redis adapter
io.adapter(createAdapter(pubClient, subClient));

// Separate Redis client for application data
const redisClient = new Redis({
  host: process.env.REDIS_HOST,
  port: process.env.REDIS_PORT,
  password: process.env.REDIS_PASSWORD
});

const PORT = process.env.PORT || 3000;
server.listen(PORT, () => {
  console.log(`WebSocket server running on port ${PORT}`);
});

Critical detail the docs don't emphasize: You need THREE separate Redis connections. The Socket.IO adapter needs dedicated pub/sub clients that don't get blocked by other Redis operations. I learned this the hard way when our application queries started causing pub/sub delays, resulting in edit latencies of 500ms+. After separating the clients, latency dropped to 20-40ms.

Here's the output when you start the server:

$ node server.js
WebSocket server running on port 3000
Redis adapter connected
Ready to accept connections

Implementing Connection Management and Authentication

One mistake I made early on was not properly authenticating WebSocket connections. We were just accepting any connection and trusting client-side data. A security researcher (thankfully a friendly one) showed us how easy it was to impersonate other users and inject malicious edits.

Here's the production authentication flow we implemented:

// middleware/auth.js
const jwt = require('jsonwebtoken');

const authenticateSocket = async (socket, next) => {
  try {
    const token = socket.handshake.auth.token;
    
    if (!token) {
      return next(new Error('Authentication token required'));
    }
    
    // Verify JWT token
    const decoded = jwt.verify(token, process.env.JWT_SECRET);
    
    // Load user from Redis cache (or database if not cached)
    const userKey = `user:${decoded.userId}`;
    let user = await redisClient.get(userKey);
    
    if (!user) {
      // Cache miss - load from database
      user = await loadUserFromDatabase(decoded.userId);
      if (!user) {
        return next(new Error('User not found'));
      }
      // Cache for 1 hour
      await redisClient.setex(userKey, 3600, JSON.stringify(user));
    } else {
      user = JSON.parse(user);
    }
    
    // Attach user to socket for use in event handlers
    socket.user = user;
    socket.userId = user.id;
    
    next();
  } catch (error) {
    console.error('Socket authentication error:', error);
    next(new Error('Authentication failed'));
  }
};

// Apply middleware to Socket.IO
io.use(authenticateSocket);

When a client connects, they must provide a valid JWT token:

// client.js
const socket = io('wss://your-domain.com', {
  auth: {
    token: localStorage.getItem('authToken')
  },
  transports: ['websocket'],
  reconnection: true,
  reconnectionDelay: 1000,
  reconnectionDelayMax: 5000,
  reconnectionAttempts: 5
});

socket.on('connect', () => {
  console.log('Connected to WebSocket server');
});

socket.on('connect_error', (error) => {
  console.error('Connection error:', error.message);
  // Handle re-authentication if token expired
  if (error.message === 'Authentication failed') {
    refreshAuthToken().then(newToken => {
      socket.auth.token = newToken;
      socket.connect();
    });
  }
});

Production output when authentication works:

Client attempting connection...
JWT verified for user: [email protected]
User loaded from cache (hit rate: 94%)
Socket authenticated: socket_abc123
Connected to WebSocket server

When authentication fails:

Client attempting connection...
JWT verification failed: TokenExpiredError
Socket connection rejected: Authentication failed
Connection error: Authentication failed
Refreshing auth token...

Document Session Management and Presence

Once users are authenticated, they need to join document sessions. This is where things get interesting because we need to track:

  1. Who's currently viewing/editing each document
  2. Where each user's cursor is positioned
  3. Which user is actively typing (for showing typing indicators)
  4. User metadata (name, avatar, color for cursor display)

Here's the production-grade session management code:

// handlers/documentSession.js
const DocumentSessionHandler = {
  
  async joinDocument(socket, documentId) {
    try {
      // Verify user has permission to access this document
      const hasAccess = await this.checkDocumentAccess(
        socket.userId, 
        documentId
      );
      
      if (!hasAccess) {
        socket.emit('error', { 
          code: 'ACCESS_DENIED',
          message: 'You do not have access to this document' 
        });
        return;
      }
      
      // Join the Socket.IO room for this document
      await socket.join(`doc:${documentId}`);
      
      // Add user to Redis set of active users for this document
      const sessionKey = `doc:${documentId}:users`;
      await redisClient.sadd(sessionKey, socket.userId);
      await redisClient.expire(sessionKey, 3600); // Expire after 1 hour of inactivity
      
      // Store user's socket ID for direct messaging
      const userSocketKey = `user:${socket.userId}:socket`;
      await redisClient.set(userSocketKey, socket.id);
      await redisClient.expire(userSocketKey, 3600);
      
      // Get current document content and version
      const docKey = `doc:${documentId}:content`;
      const versionKey = `doc:${documentId}:version`;
      
      const [content, version] = await Promise.all([
        redisClient.get(docKey),
        redisClient.get(versionKey)
      ]);
      
      // Get list of other active users
      const activeUserIds = await redisClient.smembers(sessionKey);
      const activeUsers = await this.getUserDetails(
        activeUserIds.filter(id => id !== socket.userId)
      );
      
      // Send document state to the newly joined user
      socket.emit('document-joined', {
        documentId,
        content: content || '',
        version: parseInt(version) || 0,
        activeUsers: activeUsers.map(user => ({
          id: user.id,
          name: user.name,
          avatar: user.avatar,
          color: this.getUserColor(user.id)
        }))
      });
      
      // Notify other users that someone joined
      socket.to(`doc:${documentId}`).emit('user-joined', {
        userId: socket.userId,
        name: socket.user.name,
        avatar: socket.user.avatar,
        color: this.getUserColor(socket.userId)
      });
      
      console.log(`User ${socket.userId} joined document ${documentId}`);
      
    } catch (error) {
      console.error('Error joining document:', error);
      socket.emit('error', { 
        code: 'JOIN_FAILED',
        message: 'Failed to join document session' 
      });
    }
  },
  
  async leaveDocument(socket, documentId) {
    try {
      // Remove from Socket.IO room
      await socket.leave(`doc:${documentId}`);
      
      // Remove from Redis set
      const sessionKey = `doc:${documentId}:users`;
      await redisClient.srem(sessionKey, socket.userId);
      
      // Notify other users
      socket.to(`doc:${documentId}`).emit('user-left', {
        userId: socket.userId
      });
      
      console.log(`User ${socket.userId} left document ${documentId}`);
      
    } catch (error) {
      console.error('Error leaving document:', error);
    }
  },
  
  async updateCursorPosition(socket, { documentId, position }) {
    try {
      // Store cursor position in Redis with TTL
      const cursorKey = `doc:${documentId}:cursor:${socket.userId}`;
      await redisClient.setex(
        cursorKey, 
        30, // Expire after 30 seconds of no updates
        JSON.stringify(position)
      );
      
      // Broadcast to other users in the document
      socket.to(`doc:${documentId}`).emit('cursor-update', {
        userId: socket.userId,
        position
      });
      
    } catch (error) {
      console.error('Error updating cursor:', error);
    }
  },
  
  getUserColor(userId) {
    // Generate consistent color for user based on their ID
    const colors = [
      '#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A', 
      '#98D8C8', '#F7DC6F', '#BB8FCE', '#85C1E2'
    ];
    const hash = userId.split('').reduce((acc, char) => {
      return char.charCodeAt(0) + ((acc  {
      pipeline.get(`user:${id}`);
    });
    const results = await pipeline.exec();
    return results
      .map(([err, data]) => data ? JSON.parse(data) : null)
      .filter(Boolean);
  },
  
  async checkDocumentAccess(userId, documentId) {
    // Check if user has permission (implement your authorization logic)
    const accessKey = `doc:${documentId}:access:${userId}`;
    const hasAccess = await redisClient.exists(accessKey);
    return hasAccess === 1;
  }
};

// Register event handlers
io.on('connection', (socket) => {
  
  socket.on('join-document', (documentId) => {
    DocumentSessionHandler.joinDocument(socket, documentId);
  });
  
  socket.on('leave-document', (documentId) => {
    DocumentSessionHandler.leaveDocument(socket, documentId);
  });
  
  socket.on('cursor-position', (data) => {
    DocumentSessionHandler.updateCursorPosition(socket, data);
  });
  
  socket.on('disconnect', async () => {
    // Clean up all document sessions for this user
    const userSocketKey = `user:${socket.userId}:socket`;
    await redisClient.del(userSocketKey);
    
    console.log(`User ${socket.userId} disconnected`);
  });
});

Real production output when users join:

User [email protected] (user_123) joined document doc_789
Active users in doc_789: 3
Broadcasting user-joined event to 2 other users
Cursor positions loaded: 2 active cursors
Document version: 847

What this code does that most tutorials skip:

  1. Permission checking before joining - We verify access rights before letting users into a document. This prevented a security incident where someone was guessing document IDs.

  2. Redis TTLs everywhere - Every key expires. This was crucial for handling ungraceful disconnections. Without TTLs, we had "ghost users" showing as active days after they'd left.

  3. Batch operations - When loading active users, we use Redis pipelines to fetch all user data in one round trip. This reduced our "join document" latency from 120ms to 35ms.

  4. Consistent user colors - The color generation is deterministic based on user ID, so each user always has the same color across sessions.

The Real Challenge: Operational Transformation

Now we get to the part that took me three weeks to get right: handling simultaneous edits without corrupting the document.

Here's the scenario that breaks naive implementations: Alice and Bob are both editing the same paragraph. At time T, the document contains: "The quick fox jumps."

  • Alice's edit (T+0ms): Insert "brown " at position 10 → "The quick brown fox jumps.

Unlock Premium Content

You've read 30% of this article

What's in the full article

  • Complete step-by-step implementation guide
  • Working code examples you can copy-paste
  • Advanced techniques and pro tips
  • Common mistakes to avoid
  • Real-world examples and metrics

Join 10,000+ developers who love our premium content

Daniel Hartwell

Daniel Hartwell

Author

Covers backend systems, distributed architecture, and database performance. Contributing author at NextGenBeing.

Never Miss an Article

Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.

Comments (0)

Please log in to leave a comment.

Log In

Related Articles

Don't miss the next deep dive

Get one well-researched tutorial in your inbox each week. No spam, unsubscribe anytime.