Admin
Listen to Article
Loading...Last year, our authentication system fell apart at 2am on a Tuesday. We'd just crossed 10 million active users, and suddenly our Redis session store was maxed out, login requests were timing out, and our support queue exploded with angry customers who couldn't access their accounts. I was the lead backend engineer, and I'd made every classic mistake in the book.
That incident forced us to completely rethink how we handle authentication. Over the next six months, we rebuilt our auth system from scratch, scaling it to handle 50 million users with sub-200ms login times. We survived credential stuffing attacks, implemented proper MFA, discovered why our JWT implementation was leaking memory, and learned that most "best practices" articles skip the hard parts.
Here's what actually works in production, the mistakes that cost us sleep, and the patterns we wish we'd known from day one.
The Authentication Landscape Nobody Talks About
When I started rebuilding our auth system, I thought I knew what I was doing. I'd read the OAuth2 spec, understood JWT tokens, and had implemented login systems at three previous companies. But production auth at scale is completely different from what you build for a 10,000-user SaaS app.
The first thing that surprised me: authentication isn't one problem, it's seven interconnected problems that all fail in different ways. You've got initial authentication (username/password), session management, token refresh, MFA verification, authorization checks, account recovery, and session invalidation. Each one has different performance characteristics, different security requirements, and different failure modes.
Our original system used JWT tokens stored in localStorage, with a 24-hour expiration and refresh tokens in httpOnly cookies. Sounds reasonable, right? That's what half the tutorials recommend. Here's what we discovered the hard way:
The localStorage JWT pattern breaks down at scale for three reasons nobody mentions:
First, you can't invalidate JWTs without maintaining a blocklist, which defeats the entire point of stateless tokens. When we needed to force-logout compromised accounts during a security incident, we had no way to do it. We ended up maintaining a Redis blocklist anyway, which meant we were making a database call on every request - exactly what JWTs were supposed to avoid.
Second, JWT payload size matters way more than you think. We started stuffing user permissions, organization IDs, and feature flags into the JWT to avoid database lookups. Seemed smart. But when your token grows to 2KB and you're sending it on every API request, you're adding 2KB to every single request. At 50 million requests per day, that's 100GB of bandwidth just for tokens. We were literally paying AWS hundreds of dollars per month to transfer JWT tokens.
Third, and this one really got us: browser localStorage isn't available in web workers or service workers. When we added offline support and background sync, we couldn't access the auth token from our service worker. We had to refactor everything.
My colleague Sarah, our security lead, pushed for moving to httpOnly cookies for session tokens instead. I resisted at first because "JWTs are more modern" and "cookies are legacy." I was wrong. Here's why:
Cookies with proper security flags (httpOnly, secure, sameSite=strict) are significantly more secure than localStorage for session tokens. XSS attacks can't steal them. CSRF attacks are mitigated by sameSite policies. And you can invalidate them server-side by deleting the session.
But cookies have their own gotchas. We learned this when our mobile app team tried to integrate with our cookie-based auth. Mobile apps don't handle cookies the same way browsers do. We ended up supporting both patterns: httpOnly cookies for web clients, and short-lived JWTs with refresh tokens for mobile clients.
Session Management: The Part That Actually Breaks
Here's the thing about session management: it works perfectly until you have millions of concurrent sessions, then everything falls apart in ways you didn't anticipate.
Our first session implementation used Redis as a session store. Simple key-value storage: session ID maps to user data. We set a 30-day TTL on sessions and called it done. This worked great for six months, until we hit about 5 million active users.
The problem wasn't Redis performance - Redis can handle millions of operations per second. The problem was session data size and memory usage. Here's what we were storing per session:
// Our original session structure (DON'T DO THIS)
{
userId: "uuid",
email: "user@example.com",
firstName: "John",
lastName: "Doe",
organizationId: "org-uuid",
permissions: ["read:users", "write:posts", "admin:billing", ...], // 50+ permissions
preferences: { theme: "dark", language: "en", ... },
lastActivity: 1704067200,
loginHistory: [ /* last 10 logins */ ],
featureFlags: { newUI: true, betaFeatures: false, ... }
}
Each session was about 4KB. With 10 million concurrent sessions, that's 40GB of RAM just for session data. Our Redis instance was constantly hitting memory limits, and we were paying $800/month for a massive Redis cluster.
Jake, one of our senior engineers, suggested we were thinking about sessions wrong. Instead of storing everything in the session, we should only store the minimum needed for authentication and authorization, then lazy-load everything else.
Here's what we moved to:
// Optimized session structure
{
userId: "uuid",
sessionId: "session-uuid",
createdAt: 1704067200,
lastActivity: 1704067200,
ipAddress: "192.168.1.1",
userAgent: "Mozilla/5.0...",
mfaVerified: true
}
Everything else gets loaded from the database on-demand and cached with a short TTL. This reduced our session size to about 200 bytes. Same 10 million sessions now take 2GB instead of 40GB. We dropped our Redis costs by 85%.
But here's the gotcha: now we're making more database queries. We had to be smart about caching. Here's the pattern we settled on:
// Session validation with smart caching
async function validateSession(sessionId) {
// 1. Check session exists in Redis (200 bytes)
const session = await redis.get(`session:${sessionId}`);
if (!session) {
throw new AuthenticationError('Invalid session');
}
// 2. Check if user data is in cache
let userData = await redis.get(`user:${session.userId}`);
if (!userData) {
// 3. Load from database and cache for 5 minutes
userData = await db.users.findById(session.userId);
await redis.setex(`user:${session.userId}`, 300, JSON.stringify(userData));
}
// 4. Update last activity (async, don't block response)
redis.setex(`session:${sessionId}`, SESSION_TTL, JSON.stringify({
...session,
lastActivity: Date.now()
})).catch(err => logger.error('Failed to update session activity', err));
return { session, user: userData };
}
This pattern gave us sub-50ms session validation times even under heavy load. The key insight: session validation is on the hot path for every authenticated request, so it needs to be fast. Everything else can be lazy-loaded.
Password Security: Beyond "Just Use Bcrypt"
Everyone says "use bcrypt for password hashing" but nobody talks about the production realities. We learned this during a credential stuffing attack last year that taught us some expensive lessons.
First, the attack itself: over a 48-hour period, attackers tried 2 million username/password combinations against our login endpoint. These were credentials leaked from other services - the classic credential stuffing pattern where users reuse passwords across sites.
Our initial defense was rate limiting, which helped but wasn't enough. Here's what we had:
// Rate limiting v1 (INSUFFICIENT)
const loginLimiter = rateLimit({
windowMs: 15 * 60 * 1000, // 15 minutes
max: 5, // 5 requests per window
keyGenerator: (req) => req.ip
});
app.post('/login', loginLimiter, async (req, res) => {
// login logic
});
The attackers were using a botnet with thousands of different IP addresses, so IP-based rate limiting barely slowed them down. We were still processing 200+ login attempts per second, and our bcrypt hashing was crushing our CPU.
Here's what nobody tells you about bcrypt: it's intentionally slow, which is great for security but terrible for performance under attack. We were using a cost factor of 12, which meant each password hash took about 300ms of CPU time. At 200 requests/second, we needed 60 CPU cores just to handle password hashing during the attack.
Our servers were melting. Response times for legitimate users shot up to 5+ seconds because all our CPU was busy hashing passwords for attackers.
Sarah came up with a multi-layered defense that actually worked:
Layer 1: Adaptive rate limiting based on multiple signals
// Rate limiting v2 (BETTER)
async function checkRateLimit(req) {
const ip = req.ip;
const username = req.body.username;
const fingerprint = req.headers['x-device-fingerprint'];
// Check multiple buckets
const limits = await Promise.all([
redis.get(`login:ip:${ip}`),
redis.get(`login:username:${username}`),
redis.get(`login:fingerprint:${fingerprint}`),
redis.get(`login:global`)
]);
const [ipCount, usernameCount, fingerprintCount, globalCount] = limits.map(Number);
// IP: 10 attempts per 15 minutes
if (ipCount > 10) {
throw new RateLimitError('Too many login attempts from this IP');
}
// Username: 5 attempts per 15 minutes (prevents account enumeration)
if (usernameCount > 5) {
throw new RateLimitError('Too many login attempts for this account');
}
// Device fingerprint: 8 attempts per 15 minutes
if (fingerprintCount > 8) {
throw new RateLimitError('Too many login attempts from this device');
}
// Global: 1000 attempts per minute (circuit breaker)
if (globalCount > 1000) {
throw new RateLimitError('System is under heavy load, please try again');
}
// Increment all counters
const pipeline = redis.pipeline();
pipeline.incr(`login:ip:${ip}`);
pipeline.expire(`login:ip:${ip}`, 900);
pipeline.incr(`login:username:${username}`);
pipeline.expire(`login:username:${username}`, 900);
pipeline.incr(`login:fingerprint:${fingerprint}`);
pipeline.expire(`login:fingerprint:${fingerprint}`, 900);
pipeline.incr(`login:global`);
pipeline.expire(`login:global`, 60);
await pipeline.exec();
}
This cut the attack traffic by 80% before it even hit our password hashing logic.
Layer 2: Adaptive bcrypt cost factor
Here's something I haven't seen documented anywhere: you can dynamically adjust your bcrypt cost factor based on system load. During normal operation, we use cost factor 12 for maximum security. Under attack, we temporarily drop to cost factor 10.
// Adaptive bcrypt cost
async function hashPassword(password) {
const currentLoad = await getSystemLoad();
const cost = currentLoad > 0.8 ? 10 : 12; // Drop cost under heavy load
return bcrypt.hash(password, cost);
}
async function verifyPassword(password, hash) {
const currentLoad = await getSystemLoad();
// Under heavy load, add artificial delay before checking
if (currentLoad > 0.9) {
await new Promise(resolve => setTimeout(resolve, 500));
}
return bcrypt.compare(password, hash);
}
This is controversial - you're technically weakening security under load. But here's the reality: cost factor 10 is still very secure (1024 rounds), and it's better than your entire service being down. We only do this when CPU usage is above 80%, and we log every instance for security review.
Layer 3: Progressive delays for failed attempts
After a failed login, we add an increasing delay before allowing the next attempt from that username:
async function handleFailedLogin(username) {
const failureKey = `login:failures:${username}`;
const failures = await redis.incr(failureKey);
await redis.expire(failureKey, 3600); // Reset after 1 hour
// Progressive delays: 0s, 1s, 2s, 4s, 8s, 16s, 30s (max)
const delay = Math.min(Math.
Unlock Premium Content
You've read 30% of this article
What's in the full article
- Complete step-by-step implementation guide
- Working code examples you can copy-paste
- Advanced techniques and pro tips
- Common mistakes to avoid
- Real-world examples and metrics
Don't have an account? Start your free trial
Join 10,000+ developers who love our premium content
Never Miss an Article
Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.
Comments (0)
Please log in to leave a comment.
Log InRelated Articles
Optimizing Renewable Energy Distribution with OpenADR 2.0 and RabbitMQ: A Comparative Analysis of Smart Grid Technology
Feb 16, 2026
Advanced Laravel Tutorial: Building a Real-World Application
Mar 21, 2026
Complete Solution: Building a Multi-Tenant SaaS Application with Laravel and React
Apr 22, 2026