WebSocket Implementation Guide: Real-Time Updates at Scale

Last year, our team at a SaaS company hit a wall. We had 50,000 concurrent users, and our polling-based notification system was hammering our database with 500,000 requests per minute. Our AWS bill was climbing toward $15K/month just for notification checks, and 90% of those requests returned nothing new.

I spent three weeks migrating us to WebSockets. We cut that bill to $9K/month, reduced notification latency from 30 seconds to under 200ms, and scaled to 500,000 concurrent connections. But I also made every mistake you can make with WebSockets, had two production outages, and learned things the documentation never mentions.

Here's what actually happens when you implement real-time updates at scale.

Why Our Polling System Failed (And Yours Probably Will Too)

Before I dive into WebSockets, let me show you exactly why we had to migrate. Our notification system used short polling - the client requested /api/notifications/check every 5 seconds. Simple, right?

Here's what we didn't anticipate:

// Our original polling approach - looks innocent
setInterval(async () => {
  const response = await fetch('/api/notifications/check');
  const data = await response.json();
  if (data.hasNew) {
    updateUI(data.notifications);
  }
}, 5000);

At 10,000 users, this created 120,000 requests per minute. Our database query looked like this:

SELECT * FROM notifications 
WHERE user_id = ? 
AND created_at > ? 
ORDER BY created_at DESC 
LIMIT 50;

With proper indexing, each query took about 15ms. But here's the problem - we were running this query 2,000 times per second during peak hours. Our database CPU spiked to 85%, and we started seeing connection pool exhaustion.

I added Redis caching:

// Server-side notification check with Redis
async function checkNotifications(userId, lastCheckTime) {
  const cacheKey = `notifications:${userId}:${lastCheckTime}`;
  
  // Try cache first
  let cached = await redis.get(cacheKey);
  if (cached) {
    return JSON.parse(cached);
  }
  
  // Cache miss - hit database
  const notifications = await db.query(
    'SELECT * FROM notifications WHERE user_id = ? AND created_at > ?',
    [userId, lastCheckTime]
  );
  
  // Cache for 5 seconds
  await redis.setex(cacheKey, 5, JSON.stringify(notifications));
  return notifications;
}

This helped, but we were still caching empty responses. At 50,000 users, we had 600,000 requests per minute, and 540,000 of them (90%) returned empty arrays. We were burning CPU cycles and network bandwidth to tell users "nothing new."

The real kicker? Our average notification latency was 2.5 seconds (half the polling interval), but could be as high as 5 seconds if a notification arrived right after a poll. Users complained that notifications felt "sluggish."

That's when my CTO Sarah said, "We need WebSockets."

The WebSocket Migration: What I Wish I'd Known

I started with Socket.IO because it seemed like the safe choice. The documentation made it look trivial:

// Server setup - looks so simple
const io = require('socket.io')(server);

io.on('connection', (socket) => {
  console.log('User connected:', socket.id);
  
  socket.on('disconnect', () => {
    console.log('User disconnected:', socket.id);
  });
});

// Client setup - also looks simple
const socket = io('https://api.example.com');

socket.on('notification', (data) => {
  showNotification(data);
});

I deployed this to staging on a Friday afternoon (first mistake). By Monday morning, I'd learned why WebSockets are harder than they look.

The Authentication Nightmare

Our API used JWT tokens in the Authorization header. WebSockets don't support custom headers during the initial handshake in browsers. The Socket.IO docs show this approach:

// What the docs show - doesn't work for us
const socket = io('https://api.example.com', {
  auth: {
    token: 'your-jwt-token'
  }
});

But we needed to validate tokens against our existing auth middleware. I tried passing the token as a query parameter:

// First attempt - security nightmare
const socket = io(`https://api.example.com?token=${jwtToken}`);

This worked, but tokens appeared in server logs, load balancer logs, and browser history. Our security audit flagged it immediately.

My colleague Jake suggested using Socket.IO's middleware system:

// Server-side auth middleware - what actually worked
io.use(async (socket, next) => {
  const token = socket.handshake.auth.token;
  
  if (!token) {
    return next(new Error('Authentication error'));
  }
  
  try {
    const decoded = jwt.verify(token, process.env.JWT_SECRET);
    
    // Verify token in database (check if revoked)
    const user = await db.query(
      'SELECT id, email FROM users WHERE id = ? AND token_revoked_at IS NULL',
      [decoded.userId]
    );
    
    if (!user) {
      return next(new Error('Invalid token'));
    }
    
    // Attach user to socket for later use
    socket.userId = user.id;
    socket.userEmail = user.email;
    next();
  } catch (err) {
    next(new Error('Authentication error'));
  }
});

This worked, but I discovered a new problem - token expiration. Our JWTs expired after 1 hour, but WebSocket connections could last for hours or days. Users would get disconnected randomly when their token expired.

I implemented token refresh:

// Client-side token refresh
let socket;
let tokenRefreshInterval;

function connectSocket(token) {
  socket = io('https://api.example.com', {
    auth: { token }
  });
  
  socket.on('connect_error', (err) => {
    if (err.message === 'Authentication error') {
      // Token expired - refresh it
      refreshTokenAndReconnect();
    }
  });
  
  // Refresh token every 45 minutes (before 1-hour expiry)
  tokenRefreshInterval = setInterval(async () => {
    const newToken = await refreshToken();
    
    // Reconnect with new token
    socket.disconnect();
    connectSocket(newToken);
  }, 45 * 60 * 1000);
}

async function refreshToken() {
  const response = await fetch('/api/auth/refresh', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${currentToken}`
    }
  });
  
  const data = await response.json();
  currentToken = data.token;
  localStorage.setItem('token', data.token);
  return data.token;
}

This worked, but caused a brief disconnection every 45 minutes. Users noticed. I needed a better approach.

Rooms and Namespaces: The Scaling Problem

With authentication working, I moved on to organizing connections. Socket.IO has two concepts - rooms and namespaces. I used rooms to group users:

// When user connects, join their personal room
io.on('connection', (socket) => {
  socket.join(`user:${socket.userId}`);
  
  // Also join any team rooms they're part of
  const teams = await getUserTeams(socket.userId);
  teams.forEach(team => {
    socket.join(`team:${team.id}`);
  });
});

Now I could send notifications to specific users or teams:

// Send notification to specific user
io.to(`user:${userId}`).emit('notification', {
  type: 'message',
  content: 'You have a new message'
});

// Send to entire team
io.to(`team:${teamId}`).emit('notification', {
  type: 'team_update',
  content: 'Project status changed'
});

This worked great in development. In production with 50,000 concurrent users, I discovered a critical issue - when a user was in 20 teams, they joined 21 rooms. With 50,000 users averaging 15 teams each, we had 800,000 room memberships.

Socket.IO stores rooms in memory. On our 4GB server, we were using 2.5GB just for room tracking. The server started swapping to disk, and latency spiked to 3+ seconds.

I needed to rethink our architecture.

The Production Architecture That Actually Scaled

After two weeks of testing, here's the architecture that got us to 500,000 concurrent connections:

Multi-Server Setup with Redis Adapter

Single-server WebSockets don't scale. We needed multiple servers with a shared state. Socket.IO's Redis adapter solved this:

// Server setup with Redis adapter
const { Server } = require('socket.io');
const { createAdapter } = require('@socket.io/redis-adapter');
const { createClient } = require('redis');

const io = new Server(server);

// Create Redis pub/sub clients
const pubClient = createClient({ 
  host: process.env.REDIS_HOST,
  port: 6379,
  password: process.env.REDIS_PASSWORD
});

const subClient = pubClient.duplicate();

Promise.all([pubClient.connect(), subClient.connect()]).then(() => {
  io.adapter(createAdapter(pubClient, subClient));
  console.log('Redis adapter connected');
});

Now when I emitted an event on Server A, it published to Redis, and Server B picked it up and sent it to its connected clients. Beautiful.

But I quickly hit Redis limits. With 100,000 concurrent connections across 10 servers, we were publishing 50,000 messages per second during peak hours. Redis CPU was at 90%.

I implemented message batching:

// Batch notifications before publishing
class NotificationBatcher {
  constructor(redis, batchInterval = 100) {
    this.redis = redis;
    this.batchInterval = batchInterval;
    this.batches = new Map();
    
    setInterval(() => this.flush(), batchInterval);
  }
  
  add(userId, notification) {
    if (!this.batches.has(userId)) {
      this.batches.set(userId, []);
    }
    this.batches.get(userId).push(notification);
  }
  
  async flush() {
    if (this.batches.size === 0) return;
    
    const pipeline = this.redis.pipeline();
    
    for (const [userId, notifications] of this.batches) {
      pipeline.publish(
        'notifications',
        JSON.stringify({
          userId,
          notifications
        })
      );
    }
    
    await pipeline.exec();
    this.batches.clear();
  }
}

const batcher = new NotificationBatcher(pubClient);

// Add notification to batch instead of immediate publish
function sendNotification(userId, notification) {
  batcher.add(userId, notification);
}

This reduced Redis messages by 80% and CPU usage dropped to 30%.

Connection Pooling and Load Balancing

We use HAProxy for load balancing. WebSockets need sticky sessions - you can't randomly distribute requests because the upgrade handshake must happen on the same server.

Here's our HAProxy config:

frontend websocket_frontend
    bind *:443 ssl crt /etc/ssl/certs/example.com.pem
    mode http
    
    # WebSocket detection
    acl is_websocket hdr(Upgrade) -i WebSocket
    acl is_websocket hdr_beg(Host) -i ws
    
    use_backend websocket_backend if is_websocket
    default_backend web_backend

backend websocket_backend
    mode http
    balance leastconn
    
    # Sticky sessions based on source IP
    stick-table type ip size 200k expire 30m
    stick on src
    
    # Health checks
    option httpchk GET /health
    
    # WebSocket-specific timeouts
    timeout tunnel 3600s
    timeout client 3600s
    timeout server 3600s
    
    server ws1 10.0.1.10:3000 check
    server ws2 10.0.1.11:3000 check
    server ws3 10.0.1.12:3000 check
    server ws4 10.0.1.13:3000 check

The balance leastconn directive sends new connections to the server with the fewest active connections. This gave us better distribution than round-robin.

But I discovered a problem during deployments. When we deployed a new version, HAProxy would drain connections from the old servers. With 50,000 connections per server and a 30-second drain timeout, we'd forcibly disconnect 40,000+ users.

I implemented graceful shutdown:

// Graceful shutdown handler
process.on('SIGTERM', async () => {
  console.log('SIGTERM received, starting graceful shutdown');
  
  // Stop accepting new connections
  server.close();
  
  // Notify all connected clients to reconnect
  io.emit('server_restart', {
    message: 'Server restarting, please reconnect',
    reconnectDelay: 5000
  });
  
  // Wait for clients to disconnect gracefully
  await new Promise(resolve => setTimeout(resolve, 10000));
  
  // Force disconnect any remaining clients
  io.close();
  
  // Close Redis connections
  await Promise.all([
    pubClient.quit(),
    subClient.quit()
  ]);
  
  process.exit(0);
});

Client-side reconnection logic:

socket.on('server_restart', (data) => {
  console.log('Server restarting:', data.message);
  
  // Disconnect and reconnect after delay
  socket.disconnect();
  
  setTimeout(() => {
    socket.connect();
  }, data.reconnectDelay);
});

// Auto-reconnect on disconnect
socket.on('disconnect', (reason) => {
  if (reason === 'io server disconnect') {
    // Server initiated disconnect - reconnect
    socket.connect();
  }
  // Socket.IO handles other disconnect reasons automatically
});

This reduced user-visible disconnections during deployments from 80% to under 5%.

Message Queue Integration: The Missing Piece

WebSockets solved real-time delivery, but we still needed to handle notification creation. Our application created notifications from multiple sources:

User actions (comments, mentions)
Background jobs (report generation, data imports)
External webhooks (payment confirmations, third-party integrations)
Scheduled tasks (reminders, digests)

I initially handled notifications directly in the application:

// Direct notification sending - doesn't scale
async function createNotification(userId, type, content) {
  // Save to database
  const notification = await db.query(
    'INSERT INTO notifications (user_id, type, content, created_at) VALUES (?, ?, ?, NOW())',
    [userId, type, content]
  );
  
  // Send via WebSocket
  io.to(`user:${userId}`).emit('notification', {
    id: notification.id,
    type,
    content,
    createdAt: new Date()
  });
}

This created a tight coupling between our application and WebSocket server. When the WebSocket server was down, notification creation failed. When we had a burst of notifications (like sending 10,000 notifications for a system-wide announcement), it blocked the application.

I moved to a queue-based approach using BullMQ:

// Queue-based notification system
const { Queue, Worker } = require('bullmq');

const notificationQueue = new Queue('notifications', {
  connection: {
    host: process.env.REDIS_HOST,
    port: 6379
  }
});

// Application adds jobs to queue
async function createNotification(userId, type, content) {
  await notificationQueue.add('send-notification', {
    userId,
    type,
    content,
    createdAt: new Date().toISOString()
  });
}

// Worker processes notifications
const worker = new Worker('notifications', async (job) => {
  const { userId, type, content, createdAt } = job.data;
  
  // Save to database
  const notification = await db.query(
    'INSERT INTO notifications (user_id, type, content, created_at) VALUES (?, ?, ?, ?)',
    [userId, type, content, createdAt]
  );
  
  // Send via WebSocket
  io.to(`user:${userId}`).emit('notification', {
    id: notification.id,
    type,
    content,
    createdAt
  });
}, {
  connection: {
    host: process.env.REDIS_HOST,
    port: 6379
  },
  concurrency: 50
});

This gave us several benefits:

Decoupling: Application and WebSocket server were independent
Reliability: If WebSocket server was down, notifications queued up
Rate limiting: We controlled notification throughput with concurrency settings
Retry logic: Failed notifications automatically retried

But I discovered a new problem - duplicate notifications. When a worker crashed mid-processing, BullMQ retried the job, and we'd send the same notification twice.

I implemented idempotency:

const worker = new Worker('notifications', async (job) => {
  const { userId, type, content, createdAt, idempotencyKey } = job.data;
  
  // Check if we've already processed this notification
  const existing = await redis.get(`notification:${idempotencyKey}`);
  if (existing) {
    console.log('Duplicate notification, skipping:', idempotencyKey);
    return;
  }
  
  // Save to database
  const notification = await db.query(
    'INSERT INTO notifications (user_id, type, content, created_at, idempotency_key) VALUES (?, ?, ?, ?, ?)',
    [userId, type, content, createdAt, idempotencyKey]
  );
  
  // Send via WebSocket
  io.to(`user:${userId}`).emit('notification', {
    id: notification.id,
    type,
    content,
    createdAt
  });
  
  // Mark as processed (expire after 24 hours)
  await redis.setex(`notification:${idempotencyKey}`, 86400, '1');
}, {
  connection: {
    host: process.env.REDIS_HOST,
    port: 6379
  },
  concurrency: 50
});

Now when creating a notification, we included an idempotency key:

async function createNotification(userId, type, content) {
  const idempotencyKey = `${userId}:${type}:${Date.now()}:${Math.random()}`;
  
  await notificationQueue.add('send-notification', {
    userId,
    type,
    content,
    createdAt: new Date().toISOString(),
    idempotencyKey
  });
}

This eliminated duplicate notifications completely.

Performance Optimization: Getting to 500K Connections

With the architecture working, I focused on performance. Our goal was 500,000 concurrent connections across 20 servers (25,000 per server).

Memory Optimization

Each WebSocket connection consumes memory.

Unlock Premium Content

You've read 30% of this article

What's in the full article

Complete step-by-step implementation guide
Working code examples you can copy-paste
Advanced techniques and pro tips
Common mistakes to avoid
Real-world examples and metrics

Don't have an account? Start your free trial

Join 10,000+ developers who love our premium content

Articles

Tutorials

Bloggers

Advanced Tutorial: Implementing Real-Time Updates with WebSockets

Listen to Article

Why Our Polling System Failed (And Yours Probably Will Too)

The WebSocket Migration: What I Wish I'd Known

The Authentication Nightmare

Rooms and Namespaces: The Scaling Problem

The Production Architecture That Actually Scaled

Multi-Server Setup with Redis Adapter

Connection Pooling and Load Balancing

Message Queue Integration: The Missing Piece

Performance Optimization: Getting to 500K Connections

Memory Optimization

Unlock Premium Content

What's in the full article

Keep reading

Mastering CI/CD Pipelines with Jenkins and Docker: A Deep Dive into Automated Deployment and Testing

Mastering Multi-Tenant SaaS Architecture Patterns for Enterprise Applications

Deep Dive into Next.js: Server-Side Rendering and Static Site Generation

Bekzod Erkinov

Get the AI-Assisted Developer's Field Guide

Comments (0)

Related Articles

Deep Dive into Next.js: Server-Side Rendering and Static Site Generation

Introduction to GraphQL: Benefits and Use Cases from Building Production APIs

Mastering CI/CD Pipelines with Jenkins and Docker: A Deep Dive into Automated Deployment and Testing

Before you go…

Articles

Tutorials

Bloggers

Advanced Tutorial: Implementing Real-Time Updates with WebSockets

Listen to Article

Why Our Polling System Failed (And Yours Probably Will Too)

The WebSocket Migration: What I Wish I'd Known

The Authentication Nightmare

Rooms and Namespaces: The Scaling Problem

The Production Architecture That Actually Scaled

Multi-Server Setup with Redis Adapter

Connection Pooling and Load Balancing

Message Queue Integration: The Missing Piece

Performance Optimization: Getting to 500K Connections

Memory Optimization

Unlock Premium Content

What's in the full article

Keep reading

Mastering CI/CD Pipelines with Jenkins and Docker: A Deep Dive into Automated Deployment and Testing

Mastering Multi-Tenant SaaS Architecture Patterns for Enterprise Applications

Deep Dive into Next.js: Server-Side Rendering and Static Site Generation

Bekzod Erkinov

Get the AI-Assisted Developer's Field Guide

Comments (0)

Related Articles

Deep Dive into Next.js: Server-Side Rendering and Static Site Generation

Introduction to GraphQL: Benefits and Use Cases from Building Production APIs

Mastering CI/CD Pipelines with Jenkins and Docker: A Deep Dive into Automated Deployment and Testing

Don't miss the next deep dive

Cookie & Ad Consent