NextGenBeing Founder
Listen to Article
Loading...Last year, our team hit a wall at 5 million requests per day. Our Laravel application was running on a single EC2 instance with a modest RDS PostgreSQL database, and everything seemed fine—until it wasn't. One morning at 6:47 AM, I got the dreaded PagerDuty alert: response times had spiked to 8+ seconds, and our error rate was climbing past 15%. Users were complaining, our CEO was in my Slack DMs, and I was frantically SSHing into our production server in my pajamas.
Fast forward eighteen months, and we're now handling over 100 million requests per day with an average response time of 180ms and a 99.9% uptime. We didn't get there by following some magical tutorial or implementing a single silver bullet solution. We got there through systematic optimization, painful debugging sessions, architectural rewrites, and learning from our mistakes—lots of mistakes.
This isn't a theoretical guide. This is what actually worked for us, what failed spectacularly, and what I'd do differently if I had to start over. I'm going to show you the exact architecture we built, the specific AWS services we used, the database optimization techniques that made the biggest impact, and the caching strategies that saved us thousands of dollars per month. More importantly, I'll share the gotchas that nearly took us down and the monitoring setup that keeps us sleeping at night.
The Breaking Point: When Our Architecture Failed
Before I dive into solutions, let me paint a picture of where we started. Our initial architecture was what I call "optimistic simplicity"—a single t3.large EC2 instance running Nginx and PHP-FPM, connected to a db.t3.medium RDS PostgreSQL instance. We had Laravel's built-in file cache enabled, no queue workers, and we were processing everything synchronously. For our first year with under 1 million daily requests, this setup cost us about $280/month and worked perfectly fine.
The cracks started showing around 3 million requests per day. Our database CPU would spike to 80% during peak hours. Our application server would occasionally hit 90% memory usage. But we kept throwing band-aids at it—increasing PHP-FPM workers from 20 to 40, bumping the EC2 instance to t3.xlarge, upgrading the RDS instance to db.t3.large. Each fix bought us a few more weeks of runway.
Then we launched a major feature that went semi-viral on Product Hunt. Within 48 hours, our traffic jumped from 3M to 8M requests per day. Our single-server architecture completely collapsed. Here's what the actual error logs looked like:
[2024-03-15 06:47:23] production.ERROR: SQLSTATE[08006] [7]
FATAL: sorry, too many clients already
[2024-03-15 06:47:24] production.ERROR: SQLSTATE[08006] [7]
could not connect to server: Connection refused
[2024-03-15 06:47:25] production.ERROR: Maximum execution time of 30 seconds exceeded
Our database connection pool was maxed out at 100 connections (the default for db.t3.large), and every new request was getting rejected. PHP-FPM processes were hanging waiting for database connections that would never come. The application was essentially dead in the water.
I spent that entire weekend rebuilding our infrastructure. My coworker Sarah and I were on a Zoom call from Saturday morning until Sunday night, deploying changes, monitoring metrics, rolling back when things broke, and slowly piecing together a scalable architecture. Here's what we learned.
The Architecture That Got Us to 100M Requests
Our current architecture is built on AWS and looks nothing like where we started. Here's the high-level overview:
Application Layer:
- Auto Scaling Group with 8-12 c6i.2xlarge EC2 instances (8 vCPU, 16GB RAM each)
- Application Load Balancer distributing traffic
- Laravel Octane with Swoole (persistent application state)
- PHP 8.3 with OPcache and JIT enabled
Database Layer:
- Aurora PostgreSQL Serverless v2 (2-16 ACUs, auto-scaling)
- 3 read replicas for query distribution
- PgBouncer connection pooler (transaction mode)
- RDS Proxy for connection management
Caching Layer:
- ElastiCache Redis cluster (3 nodes, r6g.xlarge)
- CloudFront CDN for static assets and API responses
- Application-level caching with Laravel's Redis driver
- Database query result caching with 5-minute TTL
Queue System:
- SQS for job queuing
- 4 dedicated c6i.xlarge workers running Laravel Horizon
- Separate queues for critical, default, and low-priority jobs
Monitoring & Logging:
- CloudWatch for metrics and alarms
- X-Ray for distributed tracing
- Laravel Telescope in production (with heavy sampling)
- Custom Datadog dashboards for business metrics
The monthly cost for this infrastructure runs about $4,800, which breaks down to roughly $0.0048 per 1,000 requests. When we were struggling at 5M requests on our old architecture, we were paying $280/month ($0.056 per 1,000 requests). Scaling up actually reduced our per-request costs by 91% while improving performance dramatically.
Let me break down each layer and show you exactly how we implemented it, including the configuration files, the gotchas we hit, and the performance improvements we measured.
Database Optimization: From 5s Queries to 50ms
Our database was the first bottleneck we hit, and it remained our biggest challenge throughout the entire scaling process. The problem wasn't just the database server specs—it was how we were using it. Let me show you the specific optimizations that made the biggest impact.
Connection Pooling: The Game Changer
The single biggest improvement came from implementing proper connection pooling. Laravel's default database configuration creates a new connection for every request, which is fine at low scale but disastrous at high scale. Here's why: PostgreSQL connection overhead is around 1-2ms per connection, plus you're limited by max_connections (100 by default on smaller RDS instances).
At 5M requests per day with an average response time of 500ms, you need about 29 concurrent connections to handle the load (5M requests / 86400 seconds * 0.5s response time). Sounds manageable, right? But traffic isn't evenly distributed. During our peak hour (usually 2-3 PM EST), we see 15% of daily traffic, which means we need 104 concurrent connections just for that hour. We were constantly hitting the connection limit.
We implemented PgBouncer in transaction mode, which pools connections at the transaction level. Here's our actual configuration:
[databases]
laravel = host=our-aurora-cluster.cluster-xxx.us-east-1.rds.amazonaws.com port=5432 dbname=production
[pgbouncer]
pool_mode = transaction
max_client_conn = 10000
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3
max_db_connections = 50
max_user_connections = 50
The key insight here is pool_mode = transaction. This allows PgBouncer to reuse database connections between transactions, not just between requests. With this configuration, we can handle 10,000 concurrent application connections with only 50 actual database connections.
In Laravel, we updated our database configuration to point to PgBouncer instead of directly to RDS:
// config/database.php
'pgsql' => [
'driver' => 'pgsql',
'host' => env('DB_HOST', 'pgbouncer-internal-lb-xxx.us-east-1.elb.amazonaws.com'),
'port' => env('DB_PORT', '6432'), // PgBouncer port
'database' => env('DB_DATABASE', 'laravel'),
'username' => env('DB_USERNAME', 'forge'),
'password' => env('DB_PASSWORD', ''),
'charset' => 'utf8',
'prefix' => '',
'prefix_indexes' => true,
'search_path' => 'public',
'sslmode' => 'prefer',
'options' => [
PDO::ATTR_TIMEOUT => 5, // Connection timeout
PDO::ATTR_PERSISTENT => false, // Don't use persistent connections with PgBouncer
],
],
⚠️ Critical gotcha: Don't use Laravel's persistent connections (PDO::ATTR_PERSISTENT => true) when you're using PgBouncer. It creates a connection leak because Laravel will hold onto connections that PgBouncer thinks are available for reuse. We discovered this after our application slowly leaked connections over 48 hours until PgBouncer hit its limit and started rejecting new connections. Took us 6 hours of debugging to figure out the root cause.
After implementing PgBouncer, our database connection errors dropped from 300+ per hour to zero. Database CPU utilization during peak hours dropped from 85% to 42%. This single change bought us the headroom to scale from 5M to 15M requests per day.
Query Optimization and Indexing
Connection pooling solved our connection limit problem, but we still had slow queries. I ran a query analysis on our production database and found some horrifying statistics:
SELECT query, calls, mean_exec_time, max_exec_time
FROM pg_stat_statements
WHERE mean_exec_time > 100
ORDER BY mean_exec_time DESC
LIMIT 10;
Output:
query | calls | mean_exec_time | max_exec_time
---------------------------------------------------------------+--------+----------------+--------------
SELECT * FROM users WHERE email = $1 | 847392 | 847.32 | 4821.44
SELECT * FROM orders WHERE user_id = $1 ORDER BY created_at... | 423847 | 523.18 | 3244.92
SELECT * FROM products WHERE category_id = $1 AND active = ... | 328471 | 412.73 | 2893.11
Our most common query—looking up users by email—was taking an average of 847ms. That's insane for a simple lookup. The problem? No index on the email column. I know, I know—how did we ship to production without an email index? In our defense, we had an index on id and assumed email lookups would be rare. We were wrong.
Here are the indexes we added that made the biggest impact:
// database/migrations/2024_03_16_add_critical_indexes.php
public function up()
{
Schema::table('users', function (Blueprint $table) {
$table->index('email'); // Reduced lookup from 847ms to 12ms
$table->index(['status', 'created_at']); // Composite index for filtered queries
});
Schema::table('orders', function (Blueprint $table) {
$table->index('user_id'); // Reduced from 523ms to 8ms
$table->index(['user_id', 'status', 'created_at']); // Covering index
$table->index('created_at'); // For date range queries
});
Schema::table('products', function (Blueprint $table) {
$table->index(['category_id', 'active']); // Composite for filtered queries
$table->index('sku'); // Unique lookups
});
// Partial index for active products only (reduces index size by 60%)
DB::statement('CREATE INDEX idx_products_active ON products (category_id) WHERE active = true');
}
The partial index on products was particularly clever. We have 2.3M products in our database, but only 900K are active at any given time. By creating a partial index that only includes active products, we reduced the index size from 180MB to 72MB, which improved query performance and reduced memory pressure on the database server.
After adding these indexes, our P95 query time dropped from 2.4 seconds to 180ms. Our database CPU utilization dropped another 15 percentage points. We were finally able to handle 25M requests per day without the database breaking a sweat.
Read Replicas and Query Distribution
Even with optimized queries and proper indexing, we were still seeing occasional CPU spikes on our primary database instance during traffic surges. The solution was read replicas, but implementing them correctly in Laravel requires some thought.
We set up 3 Aurora read replicas and configured Laravel to use them for read queries:
// config/database.php
'pgsql' => [
'read' => [
'host' => [
env('DB_READ_HOST_1', 'aurora-replica-1.xxx.us-east-1.rds.amazonaws.com'),
env('DB_READ_HOST_2', 'aurora-replica-2.xxx.us-east-1.rds.amazonaws.com'),
env('DB_READ_HOST_3', 'aurora-replica-3.xxx.us-east-1.rds.amazonaws.
Unlock Premium Content
You've read 30% of this article
What's in the full article
- Complete step-by-step implementation guide
- Working code examples you can copy-paste
- Advanced techniques and pro tips
- Common mistakes to avoid
- Real-world examples and metrics
Don't have an account? Start your free trial
Join 10,000+ developers who love our premium content
Never Miss an Article
Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.
Comments (0)
Please log in to leave a comment.
Log InRelated Articles
Building a Modern SaaS Application with Laravel - Part 3: Advanced Features & Configuration
Apr 25, 2026
Building a Modern SaaS Application with Laravel - Part 1: Architecture, Setup & Foundations
Apr 25, 2026
Optimizing Database Performance with Indexing and Caching: What We Learned Scaling to 100M Queries/Day
Apr 18, 2026