Daniel Hartwell
Listen to Article
Loading...Last March, our platform hit a wall at 47 million queries per day. Our PostgreSQL database was melting down. Average response times had crept from 800ms to 4.2 seconds. Users were complaining. Our infrastructure costs had ballooned to $142k/month just for database instances. My CTO Sarah pulled me into a conference room and said, "We need to fix this in two weeks, or we're looking at a complete architecture rewrite."
I'd been at the company for three years, and I thought I knew our database pretty well. Turns out, I knew almost nothing about how it actually performed under real load. What followed was an intense two-week sprint that taught me more about database optimization than my previous five years of development combined.
Here's what we learned about indexing and caching when your back is against the wall and every millisecond counts.
The Performance Crisis Nobody Saw Coming
We'd grown from 10,000 to 500,000 active users in eight months. Our application was a SaaS analytics platform that processed customer data in real-time. Every user action triggered multiple database queries—sometimes 15-20 queries per page load.
The warning signs were there, but we'd ignored them. Our monitoring showed query times slowly climbing, but we kept saying "we'll optimize later." That's the mistake everyone makes. Later becomes never until it becomes a crisis.
When we finally dug into the data, the numbers were brutal:
- 47 million queries per day
- Average query time: 4.2 seconds
- 95th percentile: 12.8 seconds
- Peak hour queries timing out completely
- Database CPU consistently above 85%
- Connection pool exhausted during traffic spikes
Our PostgreSQL instance was a db.r5.4xlarge on AWS (16 vCPUs, 128GB RAM), and we were already discussing upgrading to a db.r5.8xlarge at $6,800/month. That's when I realized we were trying to solve a software problem with hardware.
What the Monitoring Data Actually Revealed
I spent the first two days just understanding what was happening. We used pgBadger to analyze our PostgreSQL logs, and the results were eye-opening.
The top 10 slowest queries accounted for 73% of our total database time. Let me show you the actual query that was killing us:
SELECT
u.id, u.email, u.name, u.created_at,
p.plan_name, p.price,
COUNT(e.id) as event_count,
MAX(e.created_at) as last_event
FROM users u
LEFT JOIN subscriptions s ON u.id = s.user_id
LEFT JOIN plans p ON s.plan_id = p.id
LEFT JOIN events e ON u.id = e.user_id
WHERE u.company_id = $1
AND u.status = 'active'
AND e.created_at >= NOW() - INTERVAL '30 days'
GROUP BY u.id, u.email, u.name, u.created_at, p.plan_name, p.price
ORDER BY event_count DESC
LIMIT 50;
This query ran on our dashboard page. Every single time someone loaded their dashboard, this query executed. With 500k active users, that's a lot of dashboard loads.
Running EXPLAIN ANALYZE showed the horror:
Planning Time: 2.347 ms
Execution Time: 4,234.891 ms
Seq Scan on events e (cost=0.00..892347.23 rows=1847293 width=16)
Filter: (created_at >= (now() - '30 days'::interval))
Rows Removed by Filter: 38472934
We were doing a sequential scan on a table with 40 million rows. Every. Single. Time.
The Indexing Strategy That Changed Everything
I'll be honest—I thought I understood indexes. I knew you put them on columns you query frequently. What I didn't understand was the nuance of composite indexes, index ordering, partial indexes, and how PostgreSQL actually uses them.
My colleague Jake, who'd worked at a high-frequency trading firm, sat down with me and explained what I was missing.
Composite Indexes: Order Matters More Than You Think
The first thing Jake pointed out was our index on the events table:
CREATE INDEX idx_events_user_id ON events(user_id);
CREATE INDEX idx_events_created_at ON events(created_at);
"You've got two separate indexes," he said. "PostgreSQL might use one, but it can't efficiently use both together for this query."
We needed a composite index, but the order mattered. Here's what we created:
CREATE INDEX idx_events_user_created
ON events(user_id, created_at DESC)
WHERE created_at >= NOW() - INTERVAL '90 days';
Notice three things:
-
Column order:
user_idfirst, thencreated_at. We filter by user_id and then need to find recent events for that user. -
DESC ordering: We're ordering by event_count DESC in the query, and this index helps with that.
-
Partial index: We only care about events from the last 90 days. Why index historical data we never query?
This one index reduced our query time from 4.2 seconds to 380ms. A 91% improvement from a single CREATE INDEX statement.
But we weren't done.
The Covering Index Trick Nobody Talks About
The query was still doing a lot of work. After using the index to find the right rows, PostgreSQL had to go back to the table to fetch the actual data (a "heap fetch"). With millions of rows, that's expensive.
Jake introduced me to covering indexes—indexes that include all the columns you need, so PostgreSQL never has to touch the main table:
CREATE INDEX idx_events_user_created_covering
ON events(user_id, created_at DESC)
INCLUDE (id)
WHERE created_at >= NOW() - INTERVAL '90 days';
The INCLUDE clause adds columns to the index without making them part of the index key. This means PostgreSQL can satisfy the entire query from the index alone.
Query time dropped to 180ms. We'd gone from 4.2 seconds to 180ms with strategic indexing.
The Index Maintenance Problem We Discovered Later
Three weeks after deploying our new indexes, we noticed something weird. Our write performance had degraded. INSERT and UPDATE operations on the events table were taking 40% longer.
This is the trade-off nobody mentions in the tutorials. Every index you add makes writes slower because PostgreSQL has to update the index on every INSERT, UPDATE, or DELETE.
We had created seven indexes on the events table, and each one added overhead. We ran this query to see the impact:
SELECT
schemaname,
tablename,
indexname,
idx_scan,
idx_tup_read,
idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
AND tablename = 'events'
ORDER BY idx_scan;
Output:
indexname | idx_scan | index_size
-----------------------------------+----------+-----------
idx_events_status | 147 | 892 MB
idx_events_type | 892 | 1.2 GB
idx_events_user_created_covering | 8472934 | 3.4 GB
We had two indexes that were barely used but consuming 2.1GB and slowing down every write. We dropped them immediately.
Lesson learned: Monitor your indexes. Use pg_stat_user_indexes to find unused indexes and drop them ruthlessly.
Partial Indexes for Specific Query Patterns
One of our most common queries was finding active users who had logged in within the last 7 days:
SELECT * FROM users
WHERE status = 'active'
AND last_login_at >= NOW() - INTERVAL '7 days';
We had 2.3 million total users, but only 480,000 were active. Creating a full index on status and last_login_at would index 1.8 million rows we never query.
Instead, we used a partial index:
CREATE INDEX idx_users_active_recent_login
ON users(last_login_at DESC)
WHERE status = 'active';
This index only includes active users, making it 76% smaller and much faster to scan. The WHERE clause in the index definition tells PostgreSQL to only index rows that match that condition.
Query time went from 340ms to 45ms, and the index was only 180MB instead of 750MB.
The Expression Index That Saved Our Search Feature
Our search feature was a disaster. Users could search for other users by email, and we had this query:
SELECT * FROM users
WHERE LOWER(email) LIKE LOWER($1 || '%');
We had an index on email, but PostgreSQL couldn't use it because of the LOWER() function. Sequential scans on 2.3 million rows meant 2-3 second search times.
The solution was an expression index:
CREATE INDEX idx_users_email_lower ON users(LOWER(email) text_pattern_ops);
Two things here:
- Expression index: We index the result of LOWER(email), not the raw column
- text_pattern_ops: This operator class is optimized for LIKE queries with a prefix pattern
Search queries dropped to 12ms. Users were happy again.
When Indexes Aren't Enough: Enter Caching
Even with perfect indexes, we were still hitting the database 47 million times per day. Each query was fast, but the sheer volume was keeping our database CPU at 60-70%.
We needed caching, but I'd tried caching before and it had been a nightmare. Cache invalidation is famously one of the two hard problems in computer science (along with naming things and off-by-one errors).
The Multi-Layer Caching Strategy We Implemented
We implemented three caching layers, each with a specific purpose:
Layer 1: Application-Level Object Cache (Redis)
- Cached complete objects (users, plans, etc.)
- TTL: 5-15 minutes depending on data volatility
- Invalidated on writes
Layer 2: Query Result Cache (Redis)
- Cached expensive query results
- TTL: 1-5 minutes
- Tagged for bulk invalidation
Layer 3: HTTP Response Cache (Varnish)
- Cached complete HTTP responses
- TTL: 30-60 seconds
- Edge-side includes for personalized content
Let me show you how we implemented each layer and what we learned.
Layer 1: Object Caching with Redis
For frequently accessed objects that rarely change, we used a simple cache-aside pattern:
class UserRepository
{
private $redis;
private $db;
public function find($userId)
{
$cacheKey = "user:{$userId}";
// Try cache first
$cached = $this->redis->get($cacheKey);
if ($cached !== null) {
return json_decode($cached, true);
}
// Cache miss - fetch from database
$user = $this->db->query(
"SELECT * FROM users WHERE id = ?",
[$userId]
)->fetch();
if ($user) {
// Store in cache for 15 minutes
$this->redis->setex(
$cacheKey,
900,
json_encode($user)
);
}
return $user;
}
public function update($userId, $data)
{
// Update database
$this->db->query(
"UPDATE users SET name = ?, email = ? WHERE id = ?",
[$data['name'], $data['email'], $userId]
);
// Invalidate cache
$this->redis->del("user:{$userId}");
// Optionally, warm the cache
return $this->find($userId);
}
}
This reduced our user lookup queries by 89%. We went from 8.4 million user lookups per day to 920,000.
But we discovered a problem: cache stampedes.
The Cache Stampede Problem We Hit
When a popular user's cache expired, we'd get 50-100 simultaneous requests for that user, all hitting the database at once. We saw this in our logs:
[2024-03-15 14:23:01] Cache miss for user:12847
[2024-03-15 14:23:01] Cache miss for user:12847
[2024-03-15 14:23:01] Cache miss for user:12847
[2024-03-15 14:23:01] Cache miss for user:12847
... (47 more times)
The solution was probabilistic early expiration:
public function find($userId)
{
$cacheKey = "user:{$userId}";
$ttl = 900; // 15 minutes
$cached = $this->redis->get($cacheKey);
if ($cached !== null) {
$data = json_decode($cached, true);
// Get remaining TTL
$remainingTtl = $this->redis->ttl($cacheKey);
// Probabilistic early expiration
// Higher chance of refresh as we approach expiration
$refreshProbability = (1 - ($remainingTtl / $ttl)) * 0.1;
if (mt_rand() / mt_getrandmax() < $refreshProbability) {
// Refresh in background
$this->refreshCacheAsync($userId);
}
return $data;
}
// Cache miss - use lock to prevent stampede
$lockKey = "lock:user:{$userId}";
$locked = $this->redis->set($lockKey, 1, ['NX', 'EX' => 10]);
if ($locked) {
// We got the lock - fetch and cache
$user = $this->fetchAndCache($userId);
$this->redis->del($lockKey);
return $user;
} else {
// Someone else is fetching - wait and retry
usleep(100000); // 100ms
return $this->find($userId);
}
}
This pattern eliminated cache stampedes completely. We never saw more than one database query per cache miss, even under heavy load.
Layer 2: Query Result Caching with Tags
For complex queries that joined multiple tables, we cached the entire result set and used tags for invalidation:
class AnalyticsRepository
{
public function getUserDashboardData($userId)
{
$cacheKey = "dashboard:{$userId}";
$tags = ["user:{$userId}", "dashboard"];
return $this->cache->tags($tags)->remember($cacheKey, 300, function() use ($userId) {
return $this->db->query("
SELECT
u.id, u.email, u.name,
COUNT(e.id) as event_count,
MAX(e.created_at) as last_event
FROM users u
LEFT JOIN events e ON u.id = e.user_id
WHERE u.id = ?
AND e.created_at >= NOW() - INTERVAL '30 days'
GROUP BY u.id, u.email, u.name
", [$userId])->fetch();
});
}
public function invalidateUserCache($userId)
{
// Invalidate all cache entries tagged with this user
$this->cache->tags(["user:{$userId}"])->flush();
}
}
We used Redis with tag support (via a custom implementation using sets):
class TaggedCache
{
private $redis;
public function tags(array $tags)
{
$this->currentTags = $tags;
return $this;
}
public function remember($key, $ttl, callable $callback)
{
$value = $this->redis->get($key);
if ($value !== null) {
return json_decode($value, true);
}
$value = $callback();
// Store the value
$this->redis->setex($key, $ttl, json_encode($value));
// Add to tag sets
foreach ($this->currentTags as $tag) {
$this->redis->sadd("tag:{$tag}", $key);
$this->redis->expire("tag:{$tag}", $ttl + 3600);
}
return $value;
}
public function flush()
{
foreach ($this->currentTags as $tag) {
// Get all keys with this tag
$keys = $this->redis->smembers("tag:{$tag}");
if (!empty($keys)) {
// Delete all tagged keys
$this->redis->del(...$keys);
}
// Delete the tag set
$this->redis->del("tag:{$tag}");
}
}
}
This gave us surgical cache invalidation. When a user created a new event, we could invalidate just their dashboard cache, not everyone's.
The Cache Invalidation Strategy That Actually Works
After trying several approaches, we settled on a hybrid strategy:
- Write-through for critical data: Update both database and cache on writes
- TTL-based expiration for everything else: Let data expire naturally
- Event-based invalidation for related data: When X changes, invalidate Y
Here's our event-based invalidation:
class EventCreatedListener
{
public function handle(EventCreated $event)
{
$userId = $event->event->user_id;
// Invalidate user's dashboard cache
Cache::tags(["user:{$userId}", "dashboard"])->flush();
// Invalidate user's event list cache
Cache::tags(["user:{$userId}", "events"])->flush();
// If this affects company-wide stats, invalidate those too
if ($event->event->affects_company_stats) {
$companyId = $event->event->company_id;
Cache::tags(["company:{$companyId}", "stats"])->flush();
}
}
}
This reduced our cache invalidation bugs by 90%. We went from "cache is always stale" to "cache is fresh when it matters."
Layer 3: HTTP Response Caching with Varnish
For public pages and API endpoints, we added Varnish in front of our application servers. This was the most impactful change.
Here's our Varnish configuration:
vcl 4.1;
backend default {
.host = "app-server";
.port = "8080";
}
sub vcl_recv {
# Don't cache authenticated requests
if (req.http.Authorization || req.http.Cookie ~ "session=") {
return (pass);
}
# Cache GET and HEAD requests only
if (req.method != "GET" && req.method != "HEAD") {
return (pass);
}
# Remove marketing cookies that don't affect content
set req.http.Cookie = regsuball(req.http.Cookie, "(^|;\s*)(_ga|_gid)=[^;]*", "");
return (hash);
}
sub vcl_backend_response {
# Cache for 60 seconds by default
if (beresp.ttl 0) {
set resp.http.X-Cache = "HIT";
} else {
set resp.http.X-Cache = "MISS";
}
}
We deployed three Varnish instances behind a load balancer. The results were incredible:
- Cache hit rate: 87%
- Application server requests reduced from 47M/day to 6.1M/day
- Average response time: 18ms (down from 180ms)
- Infrastructure costs: Down $43k/month (fewer app servers needed)
But we learned Varnish has gotchas.
The Varnish Gotchas We Discovered
Gotcha #1: Cookie Handling
Varnish won't cache responses if there are cookies in the request. We had Google Analytics cookies on every request, which meant nothing was being cached.
The solution was to strip non-essential cookies in vcl_recv (shown above).
Gotcha #2: Cache Invalidation
Varnish doesn't have built-in cache invalidation by tags. We had to implement PURGE requests:
acl purge {
"localhost";
"10.0.0.0"/8; # Internal network
}
sub vcl_recv {
if (req.method == "PURGE") {
if (!client.ip ~ purge) {
return (synth(405, "Not allowed"));
}
return (purge);
}
}
Then in our application:
public function invalidateVarnishCache($url)
{
$varnishServers = ['10.0.1.10', '10.0.1.11', '10.0.1.12'];
foreach ($varnishServers as $server) {
$ch = curl_init("http://{$server}:6081{$url}");
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "PURGE");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_exec($ch);
curl_close($ch);
}
}
Gotcha #3: Grace Mode Complexity
Varnish's grace mode serves stale content while fetching fresh content in the background. Sounds great, but it can serve very stale content if your backend is down.
Unlock Premium Content
You've read 30% of this article
What's in the full article
- Complete step-by-step implementation guide
- Working code examples you can copy-paste
- Advanced techniques and pro tips
- Common mistakes to avoid
- Real-world examples and metrics
Don't have an account? Start your free trial
Join 10,000+ developers who love our premium content
Keep reading
Building a Modern SaaS Application with Laravel - Part 1: Multi-Tenancy Architecture & Database Foundations
58 min · 197 views
PerformanceDatabase Optimization Techniques for High-Traffic Websites
24 min · 78 views
Comprehensive TutorialsBuilding a Modern SaaS Application with Laravel - Part 3: Production Scaling, Queues & Observability
61 min · 69 views
Daniel Hartwell
AuthorCovers backend systems, distributed architecture, and database performance. Contributing author at NextGenBeing.
Never Miss an Article
Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.
Comments (0)
Please log in to leave a comment.
Log InRelated Articles
Building a Modern SaaS Application with Laravel - Part 1: Multi-Tenancy Architecture & Database Foundations
Apr 25, 2026
Building a Modern SaaS Application with Laravel - Part 3: Production Scaling, Queues & Observability
Apr 25, 2026
Building a Production-Grade E-Commerce Platform with Laravel 12 - Part 1: Domain-Driven Architecture & Project Setup
May 12, 2026