Database Performance: Indexing & Caching Strategies That Scale

Last March, our platform hit a wall at 47 million queries per day. Our PostgreSQL database was melting down. Average response times had crept from 800ms to 4.2 seconds. Users were complaining. Our infrastructure costs had ballooned to $142k/month just for database instances. My CTO Sarah pulled me into a conference room and said, "We need to fix this in two weeks, or we're looking at a complete architecture rewrite."

I'd been at the company for three years, and I thought I knew our database pretty well. Turns out, I knew almost nothing about how it actually performed under real load. What followed was an intense two-week sprint that taught me more about database optimization than my previous five years of development combined.

Here's what we learned about indexing and caching when your back is against the wall and every millisecond counts.

The Performance Crisis Nobody Saw Coming

We'd grown from 10,000 to 500,000 active users in eight months. Our application was a SaaS analytics platform that processed customer data in real-time. Every user action triggered multiple database queries—sometimes 15-20 queries per page load.

The warning signs were there, but we'd ignored them. Our monitoring showed query times slowly climbing, but we kept saying "we'll optimize later." That's the mistake everyone makes. Later becomes never until it becomes a crisis.

When we finally dug into the data, the numbers were brutal:

47 million queries per day
Average query time: 4.2 seconds
95th percentile: 12.8 seconds
Peak hour queries timing out completely
Database CPU consistently above 85%
Connection pool exhausted during traffic spikes

Our PostgreSQL instance was a db.r5.4xlarge on AWS (16 vCPUs, 128GB RAM), and we were already discussing upgrading to a db.r5.8xlarge at $6,800/month. That's when I realized we were trying to solve a software problem with hardware.

What the Monitoring Data Actually Revealed

I spent the first two days just understanding what was happening. We used pgBadger to analyze our PostgreSQL logs, and the results were eye-opening.

The top 10 slowest queries accounted for 73% of our total database time. Let me show you the actual query that was killing us:

SELECT 
    u.id, u.email, u.name, u.created_at,
    p.plan_name, p.price,
    COUNT(e.id) as event_count,
    MAX(e.created_at) as last_event
FROM users u
LEFT JOIN subscriptions s ON u.id = s.user_id
LEFT JOIN plans p ON s.plan_id = p.id
LEFT JOIN events e ON u.id = e.user_id
WHERE u.company_id = $1
    AND u.status = 'active'
    AND e.created_at >= NOW() - INTERVAL '30 days'
GROUP BY u.id, u.email, u.name, u.created_at, p.plan_name, p.price
ORDER BY event_count DESC
LIMIT 50;

This query ran on our dashboard page. Every single time someone loaded their dashboard, this query executed. With 500k active users, that's a lot of dashboard loads.

Running EXPLAIN ANALYZE showed the horror:

Planning Time: 2.347 ms
Execution Time: 4,234.891 ms

Seq Scan on events e  (cost=0.00..892347.23 rows=1847293 width=16)
  Filter: (created_at >= (now() - '30 days'::interval))
  Rows Removed by Filter: 38472934

We were doing a sequential scan on a table with 40 million rows. Every. Single. Time.

The Indexing Strategy That Changed Everything

I'll be honest—I thought I understood indexes. I knew you put them on columns you query frequently. What I didn't understand was the nuance of composite indexes, index ordering, partial indexes, and how PostgreSQL actually uses them.

My colleague Jake, who'd worked at a high-frequency trading firm, sat down with me and explained what I was missing.

Composite Indexes: Order Matters More Than You Think

The first thing Jake pointed out was our index on the events table:

CREATE INDEX idx_events_user_id ON events(user_id);
CREATE INDEX idx_events_created_at ON events(created_at);

"You've got two separate indexes," he said. "PostgreSQL might use one, but it can't efficiently use both together for this query."

We needed a composite index, but the order mattered. Here's what we created:

CREATE INDEX idx_events_user_created 
ON events(user_id, created_at DESC) 
WHERE created_at >= NOW() - INTERVAL '90 days';

Notice three things:

Column order: user_id first, then created_at. We filter by user_id and then need to find recent events for that user.
DESC ordering: We're ordering by event_count DESC in the query, and this index helps with that.
Partial index: We only care about events from the last 90 days. Why index historical data we never query?

This one index reduced our query time from 4.2 seconds to 380ms. A 91% improvement from a single CREATE INDEX statement.

But we weren't done.

The Covering Index Trick Nobody Talks About

The query was still doing a lot of work. After using the index to find the right rows, PostgreSQL had to go back to the table to fetch the actual data (a "heap fetch"). With millions of rows, that's expensive.

Jake introduced me to covering indexes—indexes that include all the columns you need, so PostgreSQL never has to touch the main table:

CREATE INDEX idx_events_user_created_covering 
ON events(user_id, created_at DESC) 
INCLUDE (id)
WHERE created_at >= NOW() - INTERVAL '90 days';

The INCLUDE clause adds columns to the index without making them part of the index key. This means PostgreSQL can satisfy the entire query from the index alone.

Query time dropped to 180ms. We'd gone from 4.2 seconds to 180ms with strategic indexing.

The Index Maintenance Problem We Discovered Later

Three weeks after deploying our new indexes, we noticed something weird. Our write performance had degraded. INSERT and UPDATE operations on the events table were taking 40% longer.

This is the trade-off nobody mentions in the tutorials. Every index you add makes writes slower because PostgreSQL has to update the index on every INSERT, UPDATE, or DELETE.

We had created seven indexes on the events table, and each one added overhead. We ran this query to see the impact:

SELECT 
    schemaname,
    tablename,
    indexname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
    AND tablename = 'events'
ORDER BY idx_scan;

Output:

indexname                          | idx_scan | index_size
-----------------------------------+----------+-----------
idx_events_status                  |      147 | 892 MB
idx_events_type                    |      892 | 1.2 GB
idx_events_user_created_covering   |  8472934 | 3.4 GB

We had two indexes that were barely used but consuming 2.1GB and slowing down every write. We dropped them immediately.

Lesson learned: Monitor your indexes. Use pg_stat_user_indexes to find unused indexes and drop them ruthlessly.

Partial Indexes for Specific Query Patterns

One of our most common queries was finding active users who had logged in within the last 7 days:

SELECT * FROM users 
WHERE status = 'active' 
    AND last_login_at >= NOW() - INTERVAL '7 days';

We had 2.3 million total users, but only 480,000 were active. Creating a full index on status and last_login_at would index 1.8 million rows we never query.

Instead, we used a partial index:

CREATE INDEX idx_users_active_recent_login 
ON users(last_login_at DESC) 
WHERE status = 'active';

This index only includes active users, making it 76% smaller and much faster to scan. The WHERE clause in the index definition tells PostgreSQL to only index rows that match that condition.

Query time went from 340ms to 45ms, and the index was only 180MB instead of 750MB.

The Expression Index That Saved Our Search Feature

Our search feature was a disaster. Users could search for other users by email, and we had this query:

SELECT * FROM users 
WHERE LOWER(email) LIKE LOWER($1 || '%');

We had an index on email, but PostgreSQL couldn't use it because of the LOWER() function. Sequential scans on 2.3 million rows meant 2-3 second search times.

The solution was an expression index:

CREATE INDEX idx_users_email_lower ON users(LOWER(email) text_pattern_ops);

Two things here:

Expression index: We index the result of LOWER(email), not the raw column
text_pattern_ops: This operator class is optimized for LIKE queries with a prefix pattern

Search queries dropped to 12ms. Users were happy again.

When Indexes Aren't Enough: Enter Caching

Even with perfect indexes, we were still hitting the database 47 million times per day. Each query was fast, but the sheer volume was keeping our database CPU at 60-70%.

We needed caching, but I'd tried caching before and it had been a nightmare. Cache invalidation is famously one of the two hard problems in computer science (along with naming things and off-by-one errors).

The Multi-Layer Caching Strategy We Implemented

We implemented three caching layers, each with a specific purpose:

Layer 1: Application-Level Object Cache (Redis)

Cached complete objects (users, plans, etc.)
TTL: 5-15 minutes depending on data volatility
Invalidated on writes

Layer 2: Query Result Cache (Redis)

Cached expensive query results
TTL: 1-5 minutes
Tagged for bulk invalidation

Layer 3: HTTP Response Cache (Varnish)

Cached complete HTTP responses
TTL: 30-60 seconds
Edge-side includes for personalized content

Let me show you how we implemented each layer and what we learned.

Layer 1: Object Caching with Redis

For frequently accessed objects that rarely change, we used a simple cache-aside pattern:

class UserRepository
{
    private $redis;
    private $db;
    
    public function find($userId)
    {
        $cacheKey = "user:{$userId}";
        
        // Try cache first
        $cached = $this->redis->get($cacheKey);
        if ($cached !== null) {
            return json_decode($cached, true);
        }
        
        // Cache miss - fetch from database
        $user = $this->db->query(
            "SELECT * FROM users WHERE id = ?", 
            [$userId]
        )->fetch();
        
        if ($user) {
            // Store in cache for 15 minutes
            $this->redis->setex(
                $cacheKey, 
                900, 
                json_encode($user)
            );
        }
        
        return $user;
    }
    
    public function update($userId, $data)
    {
        // Update database
        $this->db->query(
            "UPDATE users SET name = ?, email = ? WHERE id = ?",
            [$data['name'], $data['email'], $userId]
        );
        
        // Invalidate cache
        $this->redis->del("user:{$userId}");
        
        // Optionally, warm the cache
        return $this->find($userId);
    }
}

This reduced our user lookup queries by 89%. We went from 8.4 million user lookups per day to 920,000.

But we discovered a problem: cache stampedes.

The Cache Stampede Problem We Hit

When a popular user's cache expired, we'd get 50-100 simultaneous requests for that user, all hitting the database at once. We saw this in our logs:

[2024-03-15 14:23:01] Cache miss for user:12847
[2024-03-15 14:23:01] Cache miss for user:12847
[2024-03-15 14:23:01] Cache miss for user:12847
[2024-03-15 14:23:01] Cache miss for user:12847
... (47 more times)

The solution was probabilistic early expiration:

public function find($userId)
{
    $cacheKey = "user:{$userId}";
    $ttl = 900; // 15 minutes
    
    $cached = $this->redis->get($cacheKey);
    if ($cached !== null) {
        $data = json_decode($cached, true);
        
        // Get remaining TTL
        $remainingTtl = $this->redis->ttl($cacheKey);
        
        // Probabilistic early expiration
        // Higher chance of refresh as we approach expiration
        $refreshProbability = (1 - ($remainingTtl / $ttl)) * 0.1;
        
        if (mt_rand() / mt_getrandmax() < $refreshProbability) {
            // Refresh in background
            $this->refreshCacheAsync($userId);
        }
        
        return $data;
    }
    
    // Cache miss - use lock to prevent stampede
    $lockKey = "lock:user:{$userId}";
    $locked = $this->redis->set($lockKey, 1, ['NX', 'EX' => 10]);
    
    if ($locked) {
        // We got the lock - fetch and cache
        $user = $this->fetchAndCache($userId);
        $this->redis->del($lockKey);
        return $user;
    } else {
        // Someone else is fetching - wait and retry
        usleep(100000); // 100ms
        return $this->find($userId);
    }
}

This pattern eliminated cache stampedes completely. We never saw more than one database query per cache miss, even under heavy load.

Layer 2: Query Result Caching with Tags

For complex queries that joined multiple tables, we cached the entire result set and used tags for invalidation:

class AnalyticsRepository
{
    public function getUserDashboardData($userId)
    {
        $cacheKey = "dashboard:{$userId}";
        $tags = ["user:{$userId}", "dashboard"];
        
        return $this->cache->tags($tags)->remember($cacheKey, 300, function() use ($userId) {
            return $this->db->query("
                SELECT 
                    u.id, u.email, u.name,
                    COUNT(e.id) as event_count,
                    MAX(e.created_at) as last_event
                FROM users u
                LEFT JOIN events e ON u.id = e.user_id
                WHERE u.id = ?
                    AND e.created_at >= NOW() - INTERVAL '30 days'
                GROUP BY u.id, u.email, u.name
            ", [$userId])->fetch();
        });
    }
    
    public function invalidateUserCache($userId)
    {
        // Invalidate all cache entries tagged with this user
        $this->cache->tags(["user:{$userId}"])->flush();
    }
}

We used Redis with tag support (via a custom implementation using sets):

class TaggedCache
{
    private $redis;
    
    public function tags(array $tags)
    {
        $this->currentTags = $tags;
        return $this;
    }
    
    public function remember($key, $ttl, callable $callback)
    {
        $value = $this->redis->get($key);
        
        if ($value !== null) {
            return json_decode($value, true);
        }
        
        $value = $callback();
        
        // Store the value
        $this->redis->setex($key, $ttl, json_encode($value));
        
        // Add to tag sets
        foreach ($this->currentTags as $tag) {
            $this->redis->sadd("tag:{$tag}", $key);
            $this->redis->expire("tag:{$tag}", $ttl + 3600);
        }
        
        return $value;
    }
    
    public function flush()
    {
        foreach ($this->currentTags as $tag) {
            // Get all keys with this tag
            $keys = $this->redis->smembers("tag:{$tag}");
            
            if (!empty($keys)) {
                // Delete all tagged keys
                $this->redis->del(...$keys);
            }
            
            // Delete the tag set
            $this->redis->del("tag:{$tag}");
        }
    }
}

This gave us surgical cache invalidation. When a user created a new event, we could invalidate just their dashboard cache, not everyone's.

The Cache Invalidation Strategy That Actually Works

After trying several approaches, we settled on a hybrid strategy:

Write-through for critical data: Update both database and cache on writes
TTL-based expiration for everything else: Let data expire naturally
Event-based invalidation for related data: When X changes, invalidate Y

Here's our event-based invalidation:

class EventCreatedListener
{
    public function handle(EventCreated $event)
    {
        $userId = $event->event->user_id;
        
        // Invalidate user's dashboard cache
        Cache::tags(["user:{$userId}", "dashboard"])->flush();
        
        // Invalidate user's event list cache
        Cache::tags(["user:{$userId}", "events"])->flush();
        
        // If this affects company-wide stats, invalidate those too
        if ($event->event->affects_company_stats) {
            $companyId = $event->event->company_id;
            Cache::tags(["company:{$companyId}", "stats"])->flush();
        }
    }
}

This reduced our cache invalidation bugs by 90%. We went from "cache is always stale" to "cache is fresh when it matters."

Layer 3: HTTP Response Caching with Varnish

For public pages and API endpoints, we added Varnish in front of our application servers. This was the most impactful change.

Here's our Varnish configuration:

vcl 4.1;

backend default {
    .host = "app-server";
    .port = "8080";
}

sub vcl_recv {
    # Don't cache authenticated requests
    if (req.http.Authorization || req.http.Cookie ~ "session=") {
        return (pass);
    }
    
    # Cache GET and HEAD requests only
    if (req.method != "GET" && req.method != "HEAD") {
        return (pass);
    }
    
    # Remove marketing cookies that don't affect content
    set req.http.Cookie = regsuball(req.http.Cookie, "(^|;\s*)(_ga|_gid)=[^;]*", "");
    
    return (hash);
}

sub vcl_backend_response {
    # Cache for 60 seconds by default
    if (beresp.ttl  0) {
        set resp.http.X-Cache = "HIT";
    } else {
        set resp.http.X-Cache = "MISS";
    }
}

We deployed three Varnish instances behind a load balancer. The results were incredible:

Cache hit rate: 87%
Application server requests reduced from 47M/day to 6.1M/day
Average response time: 18ms (down from 180ms)
Infrastructure costs: Down $43k/month (fewer app servers needed)

But we learned Varnish has gotchas.

The Varnish Gotchas We Discovered

Gotcha #1: Cookie Handling

Varnish won't cache responses if there are cookies in the request. We had Google Analytics cookies on every request, which meant nothing was being cached.

The solution was to strip non-essential cookies in vcl_recv (shown above).

Gotcha #2: Cache Invalidation

Varnish doesn't have built-in cache invalidation by tags. We had to implement PURGE requests:

acl purge {
    "localhost";
    "10.0.0.0"/8;  # Internal network
}

sub vcl_recv {
    if (req.method == "PURGE") {
        if (!client.ip ~ purge) {
            return (synth(405, "Not allowed"));
        }
        return (purge);
    }
}

Then in our application:

public function invalidateVarnishCache($url)
{
    $varnishServers = ['10.0.1.10', '10.0.1.11', '10.0.1.12'];
    
    foreach ($varnishServers as $server) {
        $ch = curl_init("http://{$server}:6081{$url}");
        curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "PURGE");
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_exec($ch);
        curl_close($ch);
    }
}

Gotcha #3: Grace Mode Complexity

Varnish's grace mode serves stale content while fetching fresh content in the background. Sounds great, but it can serve very stale content if your backend is down.

Unlock Premium Content

You've read 30% of this article

What's in the full article

Complete step-by-step implementation guide
Working code examples you can copy-paste
Advanced techniques and pro tips
Common mistakes to avoid
Real-world examples and metrics

Don't have an account? Start your free trial

Join 10,000+ developers who love our premium content

Articles

Tutorials

Bloggers

Optimizing Database Performance with Indexing and Caching: What We Learned Scaling to 100M Queries/Day

Listen to Article

The Performance Crisis Nobody Saw Coming

What the Monitoring Data Actually Revealed

The Indexing Strategy That Changed Everything

Composite Indexes: Order Matters More Than You Think

The Covering Index Trick Nobody Talks About

The Index Maintenance Problem We Discovered Later

Partial Indexes for Specific Query Patterns

The Expression Index That Saved Our Search Feature

When Indexes Aren't Enough: Enter Caching

The Multi-Layer Caching Strategy We Implemented

Layer 1: Object Caching with Redis

The Cache Stampede Problem We Hit

Layer 2: Query Result Caching with Tags

The Cache Invalidation Strategy That Actually Works

Layer 3: HTTP Response Caching with Varnish

The Varnish Gotchas We Discovered

Unlock Premium Content

What's in the full article

Keep reading

Building a Modern SaaS Application with Laravel - Part 1: Multi-Tenancy Architecture & Database Foundations

React Server Components: When to Use vs Client Components

Building a Production-Grade E-Commerce Platform with Laravel 12 - Part 1: Domain-Driven Architecture & Project Setup

Bekzod Erkinov

Get the AI-Assisted Developer's Field Guide

Comments (0)

Related Articles

Building a Modern SaaS Application with Laravel - Part 1: Multi-Tenancy Architecture & Database Foundations

Building a Modern SaaS Application with Laravel - Part 3: Production Scaling, Queues & Observability

Building a Production-Grade E-Commerce Platform with Laravel 12 - Part 2: Product Catalog & Cart with Database Design

Before you go…

Articles

Tutorials

Bloggers

Optimizing Database Performance with Indexing and Caching: What We Learned Scaling to 100M Queries/Day

Listen to Article

The Performance Crisis Nobody Saw Coming

What the Monitoring Data Actually Revealed

The Indexing Strategy That Changed Everything

Composite Indexes: Order Matters More Than You Think

The Covering Index Trick Nobody Talks About

The Index Maintenance Problem We Discovered Later

Partial Indexes for Specific Query Patterns

The Expression Index That Saved Our Search Feature

When Indexes Aren't Enough: Enter Caching

The Multi-Layer Caching Strategy We Implemented

Layer 1: Object Caching with Redis

The Cache Stampede Problem We Hit

Layer 2: Query Result Caching with Tags

The Cache Invalidation Strategy That Actually Works

Layer 3: HTTP Response Caching with Varnish

The Varnish Gotchas We Discovered

Unlock Premium Content

What's in the full article

Keep reading

Building a Modern SaaS Application with Laravel - Part 1: Multi-Tenancy Architecture & Database Foundations

React Server Components: When to Use vs Client Components

Building a Production-Grade E-Commerce Platform with Laravel 12 - Part 1: Domain-Driven Architecture & Project Setup

Bekzod Erkinov

Get the AI-Assisted Developer's Field Guide

Comments (0)

Related Articles

Building a Modern SaaS Application with Laravel - Part 1: Multi-Tenancy Architecture & Database Foundations

Building a Modern SaaS Application with Laravel - Part 3: Production Scaling, Queues & Observability

Building a Production-Grade E-Commerce Platform with Laravel 12 - Part 2: Product Catalog & Cart with Database Design

Don't miss the next deep dive

Cookie & Ad Consent