Daniel Hartwell
Listen to Article
Loading...Last year, our team at a SaaS company hit a wall. We had 50,000 concurrent users, and our polling-based notification system was hammering our database with 500,000 requests per minute. Our AWS bill was climbing toward $15K/month just for notification checks, and 90% of those requests returned nothing new.
I spent three weeks migrating us to WebSockets. We cut that bill to $9K/month, reduced notification latency from 30 seconds to under 200ms, and scaled to 500,000 concurrent connections. But I also made every mistake you can make with WebSockets, had two production outages, and learned things the documentation never mentions.
Here's what actually happens when you implement real-time updates at scale.
Why Our Polling System Failed (And Yours Probably Will Too)
Before I dive into WebSockets, let me show you exactly why we had to migrate. Our notification system used short polling - the client requested /api/notifications/check every 5 seconds. Simple, right?
Here's what we didn't anticipate:
// Our original polling approach - looks innocent
setInterval(async () => {
const response = await fetch('/api/notifications/check');
const data = await response.json();
if (data.hasNew) {
updateUI(data.notifications);
}
}, 5000);
At 10,000 users, this created 120,000 requests per minute. Our database query looked like this:
SELECT * FROM notifications
WHERE user_id = ?
AND created_at > ?
ORDER BY created_at DESC
LIMIT 50;
With proper indexing, each query took about 15ms. But here's the problem - we were running this query 2,000 times per second during peak hours. Our database CPU spiked to 85%, and we started seeing connection pool exhaustion.
I added Redis caching:
// Server-side notification check with Redis
async function checkNotifications(userId, lastCheckTime) {
const cacheKey = `notifications:${userId}:${lastCheckTime}`;
// Try cache first
let cached = await redis.get(cacheKey);
if (cached) {
return JSON.parse(cached);
}
// Cache miss - hit database
const notifications = await db.query(
'SELECT * FROM notifications WHERE user_id = ? AND created_at > ?',
[userId, lastCheckTime]
);
// Cache for 5 seconds
await redis.setex(cacheKey, 5, JSON.stringify(notifications));
return notifications;
}
This helped, but we were still caching empty responses. At 50,000 users, we had 600,000 requests per minute, and 540,000 of them (90%) returned empty arrays. We were burning CPU cycles and network bandwidth to tell users "nothing new."
The real kicker? Our average notification latency was 2.5 seconds (half the polling interval), but could be as high as 5 seconds if a notification arrived right after a poll. Users complained that notifications felt "sluggish."
That's when my CTO Sarah said, "We need WebSockets."
The WebSocket Migration: What I Wish I'd Known
I started with Socket.IO because it seemed like the safe choice. The documentation made it look trivial:
// Server setup - looks so simple
const io = require('socket.io')(server);
io.on('connection', (socket) => {
console.log('User connected:', socket.id);
socket.on('disconnect', () => {
console.log('User disconnected:', socket.id);
});
});
// Client setup - also looks simple
const socket = io('https://api.example.com');
socket.on('notification', (data) => {
showNotification(data);
});
I deployed this to staging on a Friday afternoon (first mistake). By Monday morning, I'd learned why WebSockets are harder than they look.
The Authentication Nightmare
Our API used JWT tokens in the Authorization header. WebSockets don't support custom headers during the initial handshake in browsers. The Socket.IO docs show this approach:
// What the docs show - doesn't work for us
const socket = io('https://api.example.com', {
auth: {
token: 'your-jwt-token'
}
});
But we needed to validate tokens against our existing auth middleware. I tried passing the token as a query parameter:
// First attempt - security nightmare
const socket = io(`https://api.example.com?token=${jwtToken}`);
This worked, but tokens appeared in server logs, load balancer logs, and browser history. Our security audit flagged it immediately.
My colleague Jake suggested using Socket.IO's middleware system:
// Server-side auth middleware - what actually worked
io.use(async (socket, next) => {
const token = socket.handshake.auth.token;
if (!token) {
return next(new Error('Authentication error'));
}
try {
const decoded = jwt.verify(token, process.env.JWT_SECRET);
// Verify token in database (check if revoked)
const user = await db.query(
'SELECT id, email FROM users WHERE id = ? AND token_revoked_at IS NULL',
[decoded.userId]
);
if (!user) {
return next(new Error('Invalid token'));
}
// Attach user to socket for later use
socket.userId = user.id;
socket.userEmail = user.email;
next();
} catch (err) {
next(new Error('Authentication error'));
}
});
This worked, but I discovered a new problem - token expiration. Our JWTs expired after 1 hour, but WebSocket connections could last for hours or days. Users would get disconnected randomly when their token expired.
I implemented token refresh:
// Client-side token refresh
let socket;
let tokenRefreshInterval;
function connectSocket(token) {
socket = io('https://api.example.com', {
auth: { token }
});
socket.on('connect_error', (err) => {
if (err.message === 'Authentication error') {
// Token expired - refresh it
refreshTokenAndReconnect();
}
});
// Refresh token every 45 minutes (before 1-hour expiry)
tokenRefreshInterval = setInterval(async () => {
const newToken = await refreshToken();
// Reconnect with new token
socket.disconnect();
connectSocket(newToken);
}, 45 * 60 * 1000);
}
async function refreshToken() {
const response = await fetch('/api/auth/refresh', {
method: 'POST',
headers: {
'Authorization': `Bearer ${currentToken}`
}
});
const data = await response.json();
currentToken = data.token;
localStorage.setItem('token', data.token);
return data.token;
}
This worked, but caused a brief disconnection every 45 minutes. Users noticed. I needed a better approach.
Rooms and Namespaces: The Scaling Problem
With authentication working, I moved on to organizing connections. Socket.IO has two concepts - rooms and namespaces. I used rooms to group users:
// When user connects, join their personal room
io.on('connection', (socket) => {
socket.join(`user:${socket.userId}`);
// Also join any team rooms they're part of
const teams = await getUserTeams(socket.userId);
teams.forEach(team => {
socket.join(`team:${team.id}`);
});
});
Now I could send notifications to specific users or teams:
// Send notification to specific user
io.to(`user:${userId}`).emit('notification', {
type: 'message',
content: 'You have a new message'
});
// Send to entire team
io.to(`team:${teamId}`).emit('notification', {
type: 'team_update',
content: 'Project status changed'
});
This worked great in development. In production with 50,000 concurrent users, I discovered a critical issue - when a user was in 20 teams, they joined 21 rooms. With 50,000 users averaging 15 teams each, we had 800,000 room memberships.
Socket.IO stores rooms in memory. On our 4GB server, we were using 2.5GB just for room tracking. The server started swapping to disk, and latency spiked to 3+ seconds.
I needed to rethink our architecture.
The Production Architecture That Actually Scaled
After two weeks of testing, here's the architecture that got us to 500,000 concurrent connections:
Multi-Server Setup with Redis Adapter
Single-server WebSockets don't scale. We needed multiple servers with a shared state. Socket.IO's Redis adapter solved this:
// Server setup with Redis adapter
const { Server } = require('socket.io');
const { createAdapter } = require('@socket.io/redis-adapter');
const { createClient } = require('redis');
const io = new Server(server);
// Create Redis pub/sub clients
const pubClient = createClient({
host: process.env.REDIS_HOST,
port: 6379,
password: process.env.REDIS_PASSWORD
});
const subClient = pubClient.duplicate();
Promise.all([pubClient.connect(), subClient.connect()]).then(() => {
io.adapter(createAdapter(pubClient, subClient));
console.log('Redis adapter connected');
});
Now when I emitted an event on Server A, it published to Redis, and Server B picked it up and sent it to its connected clients. Beautiful.
But I quickly hit Redis limits. With 100,000 concurrent connections across 10 servers, we were publishing 50,000 messages per second during peak hours. Redis CPU was at 90%.
I implemented message batching:
// Batch notifications before publishing
class NotificationBatcher {
constructor(redis, batchInterval = 100) {
this.redis = redis;
this.batchInterval = batchInterval;
this.batches = new Map();
setInterval(() => this.flush(), batchInterval);
}
add(userId, notification) {
if (!this.batches.has(userId)) {
this.batches.set(userId, []);
}
this.batches.get(userId).push(notification);
}
async flush() {
if (this.batches.size === 0) return;
const pipeline = this.redis.pipeline();
for (const [userId, notifications] of this.batches) {
pipeline.publish(
'notifications',
JSON.stringify({
userId,
notifications
})
);
}
await pipeline.exec();
this.batches.clear();
}
}
const batcher = new NotificationBatcher(pubClient);
// Add notification to batch instead of immediate publish
function sendNotification(userId, notification) {
batcher.add(userId, notification);
}
This reduced Redis messages by 80% and CPU usage dropped to 30%.
Connection Pooling and Load Balancing
We use HAProxy for load balancing. WebSockets need sticky sessions - you can't randomly distribute requests because the upgrade handshake must happen on the same server.
Here's our HAProxy config:
frontend websocket_frontend
bind *:443 ssl crt /etc/ssl/certs/example.com.pem
mode http
# WebSocket detection
acl is_websocket hdr(Upgrade) -i WebSocket
acl is_websocket hdr_beg(Host) -i ws
use_backend websocket_backend if is_websocket
default_backend web_backend
backend websocket_backend
mode http
balance leastconn
# Sticky sessions based on source IP
stick-table type ip size 200k expire 30m
stick on src
# Health checks
option httpchk GET /health
# WebSocket-specific timeouts
timeout tunnel 3600s
timeout client 3600s
timeout server 3600s
server ws1 10.0.1.10:3000 check
server ws2 10.0.1.11:3000 check
server ws3 10.0.1.12:3000 check
server ws4 10.0.1.13:3000 check
The balance leastconn directive sends new connections to the server with the fewest active connections. This gave us better distribution than round-robin.
But I discovered a problem during deployments. When we deployed a new version, HAProxy would drain connections from the old servers. With 50,000 connections per server and a 30-second drain timeout, we'd forcibly disconnect 40,000+ users.
I implemented graceful shutdown:
// Graceful shutdown handler
process.on('SIGTERM', async () => {
console.log('SIGTERM received, starting graceful shutdown');
// Stop accepting new connections
server.close();
// Notify all connected clients to reconnect
io.emit('server_restart', {
message: 'Server restarting, please reconnect',
reconnectDelay: 5000
});
// Wait for clients to disconnect gracefully
await new Promise(resolve => setTimeout(resolve, 10000));
// Force disconnect any remaining clients
io.close();
// Close Redis connections
await Promise.all([
pubClient.quit(),
subClient.quit()
]);
process.exit(0);
});
Client-side reconnection logic:
socket.on('server_restart', (data) => {
console.log('Server restarting:', data.message);
// Disconnect and reconnect after delay
socket.disconnect();
setTimeout(() => {
socket.connect();
}, data.reconnectDelay);
});
// Auto-reconnect on disconnect
socket.on('disconnect', (reason) => {
if (reason === 'io server disconnect') {
// Server initiated disconnect - reconnect
socket.connect();
}
// Socket.IO handles other disconnect reasons automatically
});
This reduced user-visible disconnections during deployments from 80% to under 5%.
Message Queue Integration: The Missing Piece
WebSockets solved real-time delivery, but we still needed to handle notification creation. Our application created notifications from multiple sources:
- User actions (comments, mentions)
- Background jobs (report generation, data imports)
- External webhooks (payment confirmations, third-party integrations)
- Scheduled tasks (reminders, digests)
I initially handled notifications directly in the application:
// Direct notification sending - doesn't scale
async function createNotification(userId, type, content) {
// Save to database
const notification = await db.query(
'INSERT INTO notifications (user_id, type, content, created_at) VALUES (?, ?, ?, NOW())',
[userId, type, content]
);
// Send via WebSocket
io.to(`user:${userId}`).emit('notification', {
id: notification.id,
type,
content,
createdAt: new Date()
});
}
This created a tight coupling between our application and WebSocket server. When the WebSocket server was down, notification creation failed. When we had a burst of notifications (like sending 10,000 notifications for a system-wide announcement), it blocked the application.
I moved to a queue-based approach using BullMQ:
// Queue-based notification system
const { Queue, Worker } = require('bullmq');
const notificationQueue = new Queue('notifications', {
connection: {
host: process.env.REDIS_HOST,
port: 6379
}
});
// Application adds jobs to queue
async function createNotification(userId, type, content) {
await notificationQueue.add('send-notification', {
userId,
type,
content,
createdAt: new Date().toISOString()
});
}
// Worker processes notifications
const worker = new Worker('notifications', async (job) => {
const { userId, type, content, createdAt } = job.data;
// Save to database
const notification = await db.query(
'INSERT INTO notifications (user_id, type, content, created_at) VALUES (?, ?, ?, ?)',
[userId, type, content, createdAt]
);
// Send via WebSocket
io.to(`user:${userId}`).emit('notification', {
id: notification.id,
type,
content,
createdAt
});
}, {
connection: {
host: process.env.REDIS_HOST,
port: 6379
},
concurrency: 50
});
This gave us several benefits:
- Decoupling: Application and WebSocket server were independent
- Reliability: If WebSocket server was down, notifications queued up
- Rate limiting: We controlled notification throughput with concurrency settings
- Retry logic: Failed notifications automatically retried
But I discovered a new problem - duplicate notifications. When a worker crashed mid-processing, BullMQ retried the job, and we'd send the same notification twice.
I implemented idempotency:
const worker = new Worker('notifications', async (job) => {
const { userId, type, content, createdAt, idempotencyKey } = job.data;
// Check if we've already processed this notification
const existing = await redis.get(`notification:${idempotencyKey}`);
if (existing) {
console.log('Duplicate notification, skipping:', idempotencyKey);
return;
}
// Save to database
const notification = await db.query(
'INSERT INTO notifications (user_id, type, content, created_at, idempotency_key) VALUES (?, ?, ?, ?, ?)',
[userId, type, content, createdAt, idempotencyKey]
);
// Send via WebSocket
io.to(`user:${userId}`).emit('notification', {
id: notification.id,
type,
content,
createdAt
});
// Mark as processed (expire after 24 hours)
await redis.setex(`notification:${idempotencyKey}`, 86400, '1');
}, {
connection: {
host: process.env.REDIS_HOST,
port: 6379
},
concurrency: 50
});
Now when creating a notification, we included an idempotency key:
async function createNotification(userId, type, content) {
const idempotencyKey = `${userId}:${type}:${Date.now()}:${Math.random()}`;
await notificationQueue.add('send-notification', {
userId,
type,
content,
createdAt: new Date().toISOString(),
idempotencyKey
});
}
This eliminated duplicate notifications completely.
Performance Optimization: Getting to 500K Connections
With the architecture working, I focused on performance. Our goal was 500,000 concurrent connections across 20 servers (25,000 per server).
Memory Optimization
Each WebSocket connection consumes memory.
Unlock Premium Content
You've read 30% of this article
What's in the full article
- Complete step-by-step implementation guide
- Working code examples you can copy-paste
- Advanced techniques and pro tips
- Common mistakes to avoid
- Real-world examples and metrics
Don't have an account? Start your free trial
Join 10,000+ developers who love our premium content
Keep reading
Mastering CI/CD Pipelines with Jenkins and Docker: A Deep Dive into Automated Deployment and Testing
14 min · 240 views
Web DevelopmentMastering Multi-Tenant SaaS Architecture Patterns for Enterprise Applications
27 min · 200 views
PerformanceDatabase Optimization Techniques for High-Traffic Websites
24 min · 78 views
Daniel Hartwell
AuthorCovers backend systems, distributed architecture, and database performance. Contributing author at NextGenBeing.
Never Miss an Article
Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.
Comments (0)
Please log in to leave a comment.
Log InRelated Articles
Mastering CI/CD Pipelines with Jenkins and Docker: A Deep Dive into Automated Deployment and Testing
Mar 31, 2026
Code Quality in Production: Hard Lessons from Scaling to 50 Million Requests
May 1, 2026
AWS vs Azure vs Google Cloud: What 3 Years of Multi-Cloud Architecture Taught Me About Choosing the Right Provider
Apr 26, 2026