Admin
Listen to Article
Loading...Last year, our team at a SaaS company hit a wall. We had 50,000 concurrent users, and our polling-based notification system was hammering our database with 500,000 requests per minute. Our AWS bill was climbing toward $15K/month just for notification checks, and 90% of those requests returned nothing new.
I spent three weeks migrating us to WebSockets. We cut that bill to $9K/month, reduced notification latency from 30 seconds to under 200ms, and scaled to 500,000 concurrent connections. But I also made every mistake you can make with WebSockets, had two production outages, and learned things the documentation never mentions.
Here's what actually happens when you implement real-time updates at scale.
Why Our Polling System Failed (And Yours Probably Will Too)
Before I dive into WebSockets, let me show you exactly why we had to migrate. Our notification system used short polling - the client requested /api/notifications/check every 5 seconds. Simple, right?
Here's what we didn't anticipate:
// Our original polling approach - looks innocent
setInterval(async () => {
const response = await fetch('/api/notifications/check');
const data = await response.json();
if (data.hasNew) {
updateUI(data.notifications);
}
}, 5000);
At 10,000 users, this created 120,000 requests per minute. Our database query looked like this:
SELECT * FROM notifications
WHERE user_id = ?
AND created_at > ?
ORDER BY created_at DESC
LIMIT 50;
With proper indexing, each query took about 15ms. But here's the problem - we were running this query 2,000 times per second during peak hours. Our database CPU spiked to 85%, and we started seeing connection pool exhaustion.
I added Redis caching:
// Server-side notification check with Redis
async function checkNotifications(userId, lastCheckTime) {
const cacheKey = `notifications:${userId}:${lastCheckTime}`;
// Try cache first
let cached = await redis.get(cacheKey);
if (cached) {
return JSON.parse(cached);
}
// Cache miss - hit database
const notifications = await db.query(
'SELECT * FROM notifications WHERE user_id = ? AND created_at > ?',
[userId, lastCheckTime]
);
// Cache for 5 seconds
await redis.setex(cacheKey, 5, JSON.stringify(notifications));
return notifications;
}
This helped, but we were still caching empty responses. At 50,000 users, we had 600,000 requests per minute, and 540,000 of them (90%) returned empty arrays. We were burning CPU cycles and network bandwidth to tell users "nothing new."
The real kicker? Our average notification latency was 2.5 seconds (half the polling interval), but could be as high as 5 seconds if a notification arrived right after a poll. Users complained that notifications felt "sluggish."
That's when my CTO Sarah said, "We need WebSockets."
The WebSocket Migration: What I Wish I'd Known
I started with Socket.IO because it seemed like the safe choice. The documentation made it look trivial:
// Server setup - looks so simple
const io = require('socket.io')(server);
io.on('connection', (socket) => {
console.log('User connected:', socket.id);
socket.on('disconnect', () => {
console.log('User disconnected:', socket.id);
});
});
// Client setup - also looks simple
const socket = io('https://api.example.com');
socket.on('notification', (data) => {
showNotification(data);
});
I deployed this to staging on a Friday afternoon (first mistake). By Monday morning, I'd learned why WebSockets are harder than they look.
The Authentication Nightmare
Our API used JWT tokens in the Authorization header. WebSockets don't support custom headers during the initial handshake in browsers. The Socket.IO docs show this approach:
// What the docs show - doesn't work for us
const socket = io('https://api.example.com', {
auth: {
token: 'your-jwt-token'
}
});
But we needed to validate tokens against our existing auth middleware. I tried passing the token as a query parameter:
// First attempt - security nightmare
const socket = io(`https://api.example.com?token=${jwtToken}`);
This worked, but tokens appeared in server logs, load balancer logs, and browser history. Our security audit flagged it immediately.
My colleague Jake suggested using Socket.IO's middleware system:
// Server-side auth middleware - what actually worked
io.use(async (socket, next) => {
const token = socket.handshake.auth.token;
if (!token) {
return next(new Error('Authentication error'));
}
try {
const decoded = jwt.verify(token, process.env.JWT_SECRET);
// Verify token in database (check if revoked)
const user = await db.query(
'SELECT id, email FROM users WHERE id = ? AND token_revoked_at IS NULL',
[decoded.userId]
);
if (!user) {
return next(new Error('Invalid token'));
}
// Attach user to socket for later use
socket.userId = user.id;
socket.userEmail = user.email;
next();
} catch (err) {
next(new Error('Authentication error'));
}
});
This worked, but I discovered a new problem - token expiration. Our JWTs expired after 1 hour, but WebSocket connections could last for hours or days. Users would get disconnected randomly when their token expired.
I implemented token refresh:
// Client-side token refresh
let socket;
let tokenRefreshInterval;
function connectSocket(token) {
socket = io('https://api.example.com', {
auth: { token }
});
socket.on('connect_error', (err) => {
if (err.message === 'Authentication error') {
// Token expired - refresh it
refreshTokenAndReconnect();
}
});
// Refresh token every 45 minutes (before 1-hour expiry)
tokenRefreshInterval = setInterval(async () => {
const newToken = await refreshToken();
// Reconnect with new token
socket.
Unlock Premium Content
You've read 30% of this article
What's in the full article
- Complete step-by-step implementation guide
- Working code examples you can copy-paste
- Advanced techniques and pro tips
- Common mistakes to avoid
- Real-world examples and metrics
Don't have an account? Start your free trial
Join 10,000+ developers who love our premium content
Never Miss an Article
Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.
Comments (0)
Please log in to leave a comment.
Log In