NextGenBeing Founder
Listen to Article
Loading...Last year, my team went through something most engineers dread: migrating a high-traffic SaaS platform across all three major cloud providers. Not because we wanted to—because our acquisition meant consolidating infrastructure from three different companies, each married to their own cloud vendor. We had AWS workloads handling 20M requests/day, Azure running our ML pipelines processing 500GB daily, and Google Cloud managing our Kubernetes clusters with 200+ microservices.
What started as a nightmare turned into the most educational experience of my career. I got to see how AWS, Google Cloud, and Azure actually perform under identical production loads, with real money on the line and actual users affected by every decision.
Here's what three years of multi-cloud hell taught me. This isn't theory or marketing fluff—these are battle scars.
The Setup: Why Our Comparison Actually Matters
Most cloud comparisons you'll read are either vendor marketing disguised as content or surface-level feature checklists written by someone who's never deployed to production. I'm not going to waste your time with "AWS has X services while Azure has Y."
Instead, I'm sharing what we learned running the same workloads across all three platforms. We're talking about:
- 50M+ API requests per day distributed across REST and GraphQL endpoints
- PostgreSQL databases ranging from 500GB to 2TB with complex query patterns
- Redis clusters handling 100k+ ops/sec for session management and caching
- Kubernetes clusters running 200+ microservices with auto-scaling
- ML training pipelines processing computer vision models on GPUs
- CDN and object storage serving 10TB+ of static assets monthly
- Real-time data streaming with Kafka/Pub-Sub handling 5M events/day
Our monthly cloud spend across all three providers hit $180k at peak. When you're burning through that much money, you notice the differences fast.
Compute: Where Performance Actually Diverges
Let's start with compute because that's where most of your money goes. Everyone focuses on pricing per hour, but that's the wrong metric. What matters is price-performance ratio and how fast you can actually provision and scale.
AWS EC2: The Default That's Hard to Beat
I'll be honest—AWS EC2 is boring, and that's exactly why it works. After three years, our AWS infrastructure rarely surprises us, which in production is exactly what you want.
We run a mix of compute-optimized (c6i.2xlarge) and memory-optimized (r6i.xlarge) instances for our API servers. Here's what actually matters:
Real performance numbers from our production load tests:
# Load test: 10k concurrent users, 1M requests over 10 minutes
# AWS c6i.2xlarge (8 vCPU, 16GB RAM)
$ hey -z 10m -c 10000 -q 1667 https://api-aws.ourapp.com/health
Summary:
Total: 600.0234 secs
Requests/sec: 1666.60
Response time histogram:
0.000 [1] |
0.050 [856432]|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.100 [98234] |■■■■
0.150 [32109] |■
0.200 [8934] |
0.250 [3021] |
0.300 [1189] |
Latency distribution:
50% in 0.0423 secs
95% in 0.0876 secs
99% in 0.1234 secs
Compare that to our initial tests on t3.medium instances (which AWS loves to recommend for "cost savings"):
# Same load test on t3.medium (2 vCPU, 4GB RAM)
Response time histogram:
0.000 [1] |
0.200 [234123]|■■■■■■■■■■■■■■■■■
0.400 [398234]|■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.600 [287123]|■■■■■■■■■■■■■■■■■■■■
0.800 [54321] |■■■■
1.000 [18234] |■
Latency distribution:
50% in 0.3821 secs ← 9x SLOWER
95% in 0.7234 secs
99% in 0.9876 secs
The lesson: T3 instances with burstable CPU are a trap for anything beyond development. We learned this the hard way when our t3.large instances hit CPU credit exhaustion during a product launch. Response times went from 50ms to 800ms in under 10 minutes. Our monitoring exploded with alerts, and we lost about $15k in failed checkouts before we emergency-scaled to c6i instances.
💡 Pro Tip: AWS's Compute Optimizer is actually useful here. After running for two weeks, it recommended we switch from c5.2xlarge to c6i.2xlarge, which gave us 15% better performance for the same price. The newer Graviton instances (c7g) are even better—we saw 20% cost savings with identical performance, but you need ARM-compatible Docker images.
Google Cloud Compute Engine: Surprisingly Good for Specific Workloads
Google Cloud surprised me. I expected it to be the underdog, but for certain workloads, it actually outperforms AWS.
We migrated our ML training pipeline to Google Cloud because of their GPU availability and pricing. Here's the real comparison:
Training a ResNet-50 model on 100k images:
AWS p3.2xlarge (1x V100 GPU):
$ python train.py --epochs 50 --batch-size 64
Epoch 1/50: 100%|████████| 1563/1563 [02:34 {
const start = Date.now();
const result = await processWebhook(req.body);
const duration = Date.now() - start;
console.log(`Processed in ${duration}ms`);
res.status(200).json({ success: true });
};
Cold start metrics (1GB memory):
First invocation (cold start): 487ms ← 2x slower than Lambda
Subsequent invocations (warm): 15-22ms
After 15 minutes idle: 412ms (cold start)
Cloud Functions costs:
- Invocations: 10M requests/month at $0.40/1M = $4.00
- Compute: 10M × 200ms × 1GB = 2M GB-seconds at $0.0000025 = $5.00
- Total: ~$9/month
74% cheaper than Lambda, but the slower cold starts matter for user-facing endpoints. We use Cloud Functions for background jobs where latency isn't critical.
Azure Functions: Enterprise-Focused
Azure Functions has the worst cold starts but the best integration with Azure services.
Cold start metrics:
First invocation (cold start): 623ms ← Slowest
Subsequent invocations (warm): 18-28ms
After 15 minutes idle: 534ms (cold start)
Azure Functions costs:
- Invocations: 10M requests/month at $0.20/1M = $2.00
- Compute: 10M × 200ms × 1GB = 2M GB-seconds at $0.000016 = $32.00
- Total: ~$34/month
Azure Functions shines when you're using Azure Event Grid, Service Bus, or Cosmos DB. The integrations are seamless. But for general-purpose serverless, Lambda is better.
Networking: Where Hidden Costs Explode
Networking costs destroyed our initial cloud budget. We learned the hard way that data transfer charges add up fast.
The $12k Surprise: Cross-Region Data Transfer
In our first month on AWS, we got a $12k bill for data transfer. Here's what happened:
We had our API servers in us-east-1 and our database in us-west-2 (don't ask why—legacy reasons). Every API request transferred data across regions:
API request → us-east-1 server → us-west-2 database → us-east-1 server → user
AWS cross-region data transfer: $0.02/GB
Our API servers transferred 600TB across regions that month:
600,000 GB × $0.
Unlock Premium Content
You've read 30% of this article
What's in the full article
- Complete step-by-step implementation guide
- Working code examples you can copy-paste
- Advanced techniques and pro tips
- Common mistakes to avoid
- Real-world examples and metrics
Don't have an account? Start your free trial
Join 10,000+ developers who love our premium content
Never Miss an Article
Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.
Comments (0)
Please log in to leave a comment.
Log In