Admin
Listen to Article
Loading...Last October, our API started throwing 503s at 2am. We'd hit 12 million requests per day, and our PostgreSQL read replicas were maxed out at 95% CPU. My manager Sarah called me - "We need caching, and we need it yesterday." I'd used Redis casually before, mostly for session storage. But this was different. We needed something that could handle tens of millions of operations per day, with sub-millisecond latency, without breaking our AWS budget.
I spent the next three weeks deep in the trenches with both Redis and Memcached. We A/B tested them in production, ran benchmarks until 3am, and discovered things that aren't in any documentation. This isn't a theoretical comparison - this is what actually happened when we scaled from 12M to 50M requests per day.
Why We Couldn't Just "Add More Database Servers"
Before I dive into Redis vs Memcached, let me explain why we even needed caching. You might think, "Just scale your database horizontally." We tried that first. Here's what we learned the hard way.
Our application serves product catalog data for an e-commerce platform. We had a primary PostgreSQL instance and three read replicas. The problem wasn't write throughput - it was reads. Specifically, these queries:
SELECT p.*, c.name as category_name, b.name as brand_name
FROM products p
LEFT JOIN categories c ON p.category_id = c.id
LEFT JOIN brands b ON p.brand_id = b.id
WHERE p.status = 'active'
AND p.stock_quantity > 0
ORDER BY p.created_at DESC
LIMIT 50;
Even with proper indexes, this query took 45-80ms during peak hours. We were running it thousands of times per minute. At 12M requests per day, that's roughly 8,300 requests per minute during peak. Even distributed across three read replicas, each replica was handling 2,700+ queries per minute just for this endpoint.
We added a fourth read replica. Cost went up by $850/month (db.r5.2xlarge). Performance improved by maybe 15%. Not enough. The real issue was that we were querying the database for data that rarely changed. Product information updates maybe a few times per hour, but we were hitting the database thousands of times per minute.
That's when I realized we needed a proper caching layer. The question was: Redis or Memcached?
The Architecture Decision That Kept Me Up at Night
I started researching both options. The internet is full of comparisons, but they're mostly theoretical. "Redis supports more data structures." "Memcached is simpler and faster." Great, but what does that mean when you're serving real traffic?
My colleague Jake had used Memcached at his previous company (a social media analytics platform). He swore by its simplicity and raw speed. "It's just a hash table," he said. "That's all you need for caching."
But I'd heard Redis was more versatile. Our CTO had mentioned we might need pub/sub for real-time features eventually. Redis supports that natively. It also has persistence options, which sounded appealing.
I decided to test both in a staging environment first, then run a controlled A/B test in production. Here's what I set up:
Test Environment:
- AWS ElastiCache for both Redis and Memcached
- Redis: cache.r6g.xlarge (4 vCPUs, 12.93 GB RAM)
- Memcached: cache.r6g.xlarge (4 vCPUs, 12.93 GB RAM)
- Same instance size for fair comparison
- Cost:
$0.282/hour each ($203/month)
Application Stack:
- Python 3.11 with Flask
- Flask-Caching library (supports both backends)
- Gunicorn with 8 workers
- NGINX as reverse proxy
Test Methodology: I used Apache Bench (ab) and Locust for load testing. I wanted to simulate real-world traffic patterns, not just synthetic benchmarks.
Round 1: Simple Key-Value Operations (Where Memcached Shines)
Let me start with the basics - simple GET/SET operations. This is where Memcached is supposed to excel. It's designed for exactly this use case.
Test Setup:
# Redis implementation
from flask_caching import Cache
import redis
redis_cache = Cache(config={
'CACHE_TYPE': 'redis',
'CACHE_REDIS_HOST': 'my-redis-cluster.cache.amazonaws.com',
'CACHE_REDIS_PORT': 6379,
'CACHE_REDIS_DB': 0,
'CACHE_DEFAULT_TIMEOUT': 3600,
'CACHE_KEY_PREFIX': 'prod_'
})
# Memcached implementation
memcached_cache = Cache(config={
'CACHE_TYPE': 'memcached',
'CACHE_MEMCACHED_SERVERS': ['my-memcached-cluster.cache.amazonaws.com:11211'],
'CACHE_DEFAULT_TIMEOUT': 3600,
'CACHE_KEY_PREFIX': 'prod_'
})
@app.route('/api/products/<int:product_id>')
def get_product(product_id):
cache_key = f'product_{product_id}'
# Try cache first
cached_data = cache.get(cache_key)
if cached_data:
return jsonify(cached_data), 200
# Cache miss - query database
product = db.session.query(Product).filter_by(id=product_id).first()
if not product:
return jsonify({'error': 'Not found'}), 404
product_data = {
'id': product.id,
'name': product.name,
'price': float(product.price),
'stock': product.stock_quantity,
'category': product.category.name,
'brand': product.brand.name
}
# Store in cache
cache.set(cache_key, product_data, timeout=3600)
return jsonify(product_data), 200
I ran 100,000 requests with 100 concurrent connections:
ab -n 100000 -c 100 https://api.example.com/api/products/12345
Memcached Results:
Requests per second: 18,423.67 [#/sec] (mean)
Time per request: 5.428 [ms] (mean)
Time per request: 0.054 [ms] (mean, across all concurrent requests)
Transfer rate: 12,847.23 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.2 0 3
Processing: 1 5 2.1 5 28
Waiting: 1 5 2.1 5 28
Total: 1 5 2.1 5 28
Percentage of requests served within a certain time (ms)
50% 5
66% 6
75% 6
80% 7
90% 8
95% 9
98% 11
99% 13
100% 28 (longest request)
Redis Results:
Requests per second: 16,891.23 [#/sec] (mean)
Time per request: 5.920 [ms] (mean)
Time per request: 0.059 [ms] (mean, across all concurrent requests)
Transfer rate: 11,778.45 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.3 0 4
Processing: 1 6 2.4 5 32
Waiting: 1 6 2.4 5 32
Total: 1 6 2.4 6 32
Percentage of requests served within a certain time (ms)
50% 6
66% 7
75% 7
80% 8
90% 9
95% 11
98% 13
99% 15
100% 32 (longest request)
Memcached was about 9% faster for simple GET operations. Not a huge difference, but noticeable at scale. With 50M requests per day, that 9% translates to roughly 4.5M fewer milliseconds of total processing time.
But here's what surprised me: the difference was most pronounced during peak load. When I pushed to 200 concurrent connections, Memcached maintained its performance better:
At 200 concurrent connections:
- Memcached: 17,234 req/sec
- Redis: 14,567 req/sec
Memcached's simpler architecture (no persistence layer, no complex data structures) meant less overhead. For pure key-value operations, Jake was right - it's faster.
Round 2: Complex Data Structures (Where Redis Dominates)
But then I tested something more realistic. Our product catalog doesn't just cache individual products. We cache:
- Product lists by category
- Search results
- User shopping carts
- Product recommendations
- Real-time inventory counts
This is where things got interesting. Let me show you a real scenario we encountered.
Use Case: Shopping Cart Management
With Memcached, a shopping cart is just a serialized object:
# Memcached approach - serialize everything
import pickle
def add_to_cart_memcached(user_id, product_id, quantity):
cart_key = f'cart_{user_id}'
# Get entire cart
cart = memcached_cache.get(cart_key)
if not cart:
cart = {}
else:
cart = pickle.loads(cart)
# Modify cart
if product_id in cart:
cart[product_id] += quantity
else:
cart[product_id] = quantity
# Save entire cart back
memcached_cache.set(cart_key, pickle.dumps(cart))
return cart
def get_cart_item_count_memcached(user_id):
cart_key = f'cart_{user_id}'
cart = memcached_cache.get(cart_key)
if not cart:
return 0
cart = pickle.loads(cart)
return sum(cart.values())
Every operation requires:
- Fetching the entire cart
- Deserializing it
- Modifying it
- Serializing it
- Storing it back
With Redis, I can use native hash operations:
# Redis approach - atomic operations
import redis
redis_client = redis.Redis(
host='my-redis-cluster.cache.amazonaws.com',
port=6379,
decode_responses=True
)
def add_to_cart_redis(user_id, product_id, quantity):
cart_key = f'cart_{user_id}'
# Atomic increment
redis_client.hincrby(cart_key, product_id, quantity)
# Set expiration
redis_client.expire(cart_key, 86400) # 24 hours
# Return updated cart
return redis_client.hgetall(cart_key)
def get_cart_item_count_redis(user_id):
cart_key = f'cart_{user_id}'
cart = redis_client.hgetall(cart_key)
return sum(int(qty) for qty in cart.values())
I benchmarked 10,000 cart operations (mix of adds, updates, and reads):
Memcached:
- Total time: 8.3 seconds
- Average per operation: 0.83ms
- Peak memory usage: 245 MB (due to serialization overhead)
Redis:
- Total time: 3.1 seconds
- Average per operation: 0.31ms
- Peak memory usage: 89 MB
Redis was 2.7x faster for cart operations. Why? No serialization overhead. Redis understands hashes natively. Operations are atomic. No need to fetch-modify-store.
But the real win came when I tested concurrent cart modifications. Imagine 100 users all adding items to their carts simultaneously:
# Stress test with concurrent modifications
import concurrent.futures
import time
def stress_test_cart_operations(cache_impl, num_users=100, operations_per_user=50):
start = time.time()
def user_session(user_id):
for i in range(operations_per_user):
product_id = f"prod_{i % 20}"
add_to_cart(cache_impl, user_id, product_id, 1)
if i % 10 == 0:
get_cart_item_count(cache_impl, user_id)
with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
futures = [executor.submit(user_session, f"user_{i}") for i in range(num_users)]
concurrent.futures.wait(futures)
elapsed = time.time() - start
total_ops = num_users * operations_per_user
return elapsed, total_ops / elapsed
# Results
memcached_time, memcached_ops = stress_test_cart_operations('memcached')
redis_time, redis_ops = stress_test_cart_operations('redis')
print(f"Memcached: {memcached_time:.2f}s, {memcached_ops:.0f} ops/sec")
print(f"Redis: {redis_time:.2f}s, {redis_ops:.0f} ops/sec")
Output:
Memcached: 24.67s, 202 ops/sec
Redis: 7.89s, 633 ops/sec
Redis handled concurrent modifications 3.1x better than Memcached. This is crucial for high-traffic applications where multiple operations happen simultaneously.
The Persistence Question That Changed Everything
Three weeks into our testing, something happened that made me really glad we'd chosen Redis for one of our use cases.
At 4:30am on a Tuesday, our Memcached cluster had a node failure. AWS ElastiCache automatically replaced it, but here's what happened: we lost all cached data on that node. Our cache hit rate dropped from 89% to 31% instantly. Database load spiked to 98% CPU. We had to throttle traffic for 20 minutes while the cache warmed up.
This is Memcached's design - it's a pure memory cache. When a node dies, data is gone. There's no persistence, no replication, no recovery.
Redis offers persistence options:
RDB (Redis Database Backup):
# redis.conf
save 900 1 # Save after 900 seconds if at least 1 key changed
save 300 10 # Save after 300 seconds if at least 10 keys changed
save 60 10000 # Save after 60 seconds if at least 10000 keys changed
dbfilename dump.rdb
dir /var/lib/redis
AOF (Append Only File):
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec # fsync every second (good balance)
I configured Redis with AOF persistence using appendfsync everysec. This means Redis writes every command to disk, but only fsyncs once per second. It's a good balance between durability and performance.
The performance impact? About 8-12% throughput reduction with AOF enabled. Here's what I measured:
Redis without persistence:
- 16,891 req/sec
- 0.059ms average latency
Redis with AOF (appendfsync everysec):
- 14,823 req/sec
- 0.067ms average latency
Is the performance hit worth it? Depends on your use case. For session data and shopping carts, absolutely. Losing a user's cart because a cache node died is a terrible user experience. For product catalog data that can be quickly regenerated from the database, maybe not.
With Memcached, you don't have this choice. It's always volatile.
Memory Efficiency: The Surprise Winner
I expected Redis to use more memory because of its richer data structures and persistence. I was wrong.
I loaded 1 million product records into both caches and measured memory usage:
Test Data:
# Each product record
{
'id': 12345,
'name': 'Sample Product Name',
'price': 29.99,
'stock': 100,
'category': 'Electronics',
'brand': 'BrandName',
'description': 'A 200-character product description...',
'attributes': {
'color': 'blue',
'size': 'medium',
'weight': '1.5kg'
}
}
Memcached Memory Usage:
# Check Memcached stats
echo "stats" | nc my-memcached-cluster.cache.amazonaws.com 11211
STAT bytes 3847293184
STAT limit_maxbytes 13421772800
Total: 3.58 GB for 1M records
Redis Memory Usage:
redis-cli -h my-redis-cluster.cache.amazonaws.com info memory
used_memory:2938472448
used_memory_human:2.74G
used_memory_peak:2938472448
used_memory_peak_human:2.74G
Total: 2.74 GB for 1M records
Redis used 23% less memory than Memcached for the same data. How?
Redis has better compression and more efficient storage for certain data types. When I stored the product attributes as a Redis hash instead of a serialized JSON string, memory usage dropped even further:
# Instead of storing as JSON string
cache.set(f'product_{id}', json.dumps(product_data))
# Store as Redis hash
redis_client.hset(f'product_{id}', mapping={
'name': product_data['name'],
'price': str(product_data['price']),
'stock': str(product_data['stock']),
# ... etc
})
With this approach, Redis memory usage dropped to 2.31 GB - a 35% reduction compared to Memcached.
But here's a gotcha: Redis memory usage can grow unpredictably if you're not careful with persistence. The AOF file can grow large over time. You need to configure AOF rewriting:
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
This tells Redis to rewrite the AOF file when it grows 100% larger than the last rewrite and is at least 64MB. Without this, we saw AOF files grow to 15GB+ over a week, even though actual data was only 3GB.
The Eviction Policy Nightmare
Both Redis and Memcached use LRU (Least Recently Used) eviction when memory is full, but they behave differently. This caused us a production incident I'll never forget.
We were caching API responses with a 1-hour TTL. Under normal load, everything worked fine. Then we had a traffic spike - a product went viral on social media. Traffic increased 8x in 30 minutes.
What happened with Memcached:
Memcached started evicting keys to make room for new data. But here's the problem: it evicts based on LRU across the entire keyspace. It doesn't care about TTLs. So it was evicting keys that were only 5 minutes old (with 55 minutes left on their TTL) to make room for new keys.
Our cache hit rate dropped from 87% to 43%. Database load spiked. We had to emergency-scale our database.
What happened with Redis:
Redis has multiple eviction policies. We were using volatile-lru, which only evicts keys with an expiration set, and prefers keys that are expiring soon.
maxmemory 10gb
maxmemory-policy volatile-lru
During the same traffic spike, Redis maintained a 76% cache hit rate. It was smarter about what to evict.
Here are Redis's eviction policies:
noeviction: Return errors when memory limit is reachedallkeys-lru: Evict any key using LRUvolatile-lru: Evict keys with expiration set using LRUallkeys-random: Evict random keysvolatile-random: Evict random keys with expiration setvolatile-ttl: Evict keys with expiration set, preferring shorter TTL
For our use case, volatile-lru was perfect. All our cached data had TTLs, and we wanted to keep frequently accessed data even during memory pressure.
With Memcached, you only get LRU across everything. No nuance.
Real-World Production Architecture
After three months of testing, here's what we actually deployed:
Redis for:
- User sessions (need persistence, need expiration)
- Shopping carts (need atomic operations, need persistence)
- Real-time data (pub/sub, sorted sets for leaderboards)
- Rate limiting (atomic increments, TTLs)
Memcached for:
- Product catalog cache (pure speed, can regenerate quickly)
- API response cache (simple key-value, high throughput)
- Database query cache (temporary, high volume)
We're running:
- 2x Redis clusters (cache.r6g.xlarge) - $406/month
- 1x Memcached cluster (cache.r6g.2xlarge) - $407/month
- Total: ~$813/month
Redis Cluster Configuration:
# Production Redis config
maxmemory 10gb
maxmemory-policy volatile-lru
appendonly yes
appendfsync everysec
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
# Replication
replicaof no one # This is the master
replica-read-only yes
min-replicas-to-write 1
min-replicas-max-lag 10
# Persistence
save 900 1
save 300 10
save 60 10000
Memcached Configuration:
# ElastiCache parameter group
maxmemory: 20gb
chunk_size: 48
max_item_size: 1mb
The Client Library Minefield
One thing that bit us hard: not all Redis/Memcached clients are created equal. We started with pylibmc for Memcached and redis-py for Redis. Both are popular, but we ran into issues.
Memcached Client Problems:
pylibmc is fast but has quirks:
import pylibmc
# Connection pooling is critical
mc = pylibmc.Client(
['my-memcached-cluster.cache.amazonaws.com:11211'],
binary=True,
behaviors={
'tcp_nodelay': True,
'ketama': True, # Consistent hashing
'no_block': True,
'num_threads': 4
}
)
# But it doesn't handle connection failures gracefully
try:
mc.set('key', 'value')
except pylibmc.Error as e:
# Connection died, but pylibmc doesn't auto-reconnect
# You need to recreate the client
mc = pylibmc.Client([...])
We had to implement our own retry logic and connection pooling. After two weeks of debugging intermittent failures, we switched to pymemcache:
from pymemcache.client.base import PooledClient
from pymemcache import serde
# Much better connection handling
mc = PooledClient(
'my-memcached-cluster.cache.amazonaws.com',
max_pool_size=32,
connect_timeout=2.0,
timeout=1.0,
no_delay=True,
serde=serde.pickle_serde
)
pymemcache handles connection failures gracefully, has better pooling, and is actively maintained.
Redis Client Lessons:
redis-py is solid, but you need to configure connection pooling properly:
import redis
# Bad: creates new connection for every operation
r = redis.Redis(host='my-redis-cluster.cache.amazonaws.com', port=6379)
# Good: uses connection pool
pool = redis.ConnectionPool(
host='my-redis-cluster.cache.amazonaws.com',
port=6379,
max_connections=50,
socket_keepalive=True,
socket_connect_timeout=2,
socket_timeout=2,
retry_on_timeout=True,
health_check_interval=30
)
r = redis.Redis(connection_pool=pool)
We also discovered that redis-py's default behavior is to decode all responses as strings. This caused weird bugs with binary data:
# Default behavior
r = redis.Redis(host='...', decode_responses=True)
r.set('key', b'\x00\x01\x02') # Binary data
value = r.get('key') # Raises UnicodeDecodeError!
# Fix: disable decode_responses for binary data
r = redis.Redis(host='...', decode_responses=False)
r.set('key', b'\x00\x01\x02')
value = r.get('key') # Returns b'\x00\x01\x02'
Monitoring and Observability
You can't optimize what you don't measure. Here's what we monitor for both Redis and Memcached.
Redis Monitoring:
We use CloudWatch for AWS ElastiCache, but also export metrics to Prometheus:
from prometheus_client import Counter, Histogram, Gauge
import time
# Metrics
redis_ops = Counter('redis_operations_total', 'Total Redis operations', ['operation', 'status'])
redis_latency = Histogram('redis_operation_duration_seconds', 'Redis operation latency', ['operation'])
redis_connections = Gauge('redis_connections_active', 'Active Redis connections')
def monitored_redis_get(key):
start = time.time()
try:
result = redis_client.get(key)
redis_ops.labels(operation='get', status='success').inc()
return result
except Exception as e:
redis_ops.labels(operation='get', status='error').inc()
raise
finally:
redis_latency.labels(operation='get').observe(time.time() - start)
Key Redis Metrics:
# Get Redis stats
redis-cli -h my-redis-cluster.cache.amazonaws.com info
# Important metrics:
connected_clients:42
used_memory:2847293184
used_memory_peak:3012847392
instantaneous_ops_per_sec:8234
keyspace_hits:182374
keyspace_misses:23847
evicted_keys:0
expired_keys:14823
# Cache hit rate
hit_rate = keyspace_hits / (keyspace_hits + keyspace_misses)
# 182374 / (182374 + 23847) = 88.4%
We alert on:
- Hit rate < 75%
- Evicted keys > 1000/hour
- Used memory > 90% of max
- Connected clients > 80% of max
- Replication lag > 5 seconds
Memcached Monitoring:
Memcached's stats are simpler but still valuable:
import pymemcache
from pymemcache.client.base import PooledClient
mc = PooledClient('my-memcached-cluster.cache.amazonaws.com')
# Get stats
stats = mc.stats()
# Key metrics
print(f"Get hits: {stats[b'get_hits']}")
print(f"Get misses: {stats[b'get_misses']}")
print(f"Evictions: {stats[b'evictions']}")
print(f"Bytes used: {stats[b'bytes']}")
print(f"Current connections: {stats[b'curr_connections']}")
# Calculate hit rate
hits = int(stats[b'get_hits'])
misses = int(stats[b'get_misses'])
hit_rate = hits / (hits + misses) if (hits + misses) > 0 else 0
print(f"Hit rate: {hit_rate:.2%}")
Output:
Get hits: 8234782
Get misses: 892341
Evictions: 12847
Bytes used: 3847293184
Current connections: 156
Hit rate: 90.22%
Cache Invalidation: The Hard Problem
Phil Karlton said there are only two hard things in Computer Science: cache invalidation and naming things. He was right.
We tried several cache invalidation strategies. Here's what worked and what didn't.
Strategy 1: TTL-Based (Simple but Wasteful)
# Set everything with a TTL
cache.set('product_12345', product_data, timeout=3600) # 1 hour
# Problem: Data might update every 5 minutes, but we're serving
# stale data for up to an hour
This is the simplest approach, but it's wasteful. We were serving stale product prices for up to an hour. Not acceptable for e-commerce.
Strategy 2: Write-Through Cache (Complex but Accurate)
def update_product(product_id, new_data):
# Update database
product = db.session.query(Product).get(product_id)
product.name = new_data['name']
product.price = new_data['price']
db.session.commit()
# Immediately update cache
cache_key = f'product_{product_id}'
cache.set(cache_key, {
'id': product.id,
'name': product.name,
'price': float(product.price),
# ...
}, timeout=3600)
This works but requires changing every database write. We had 47 different places in our codebase that updated products. Retrofitting this was a nightmare.
Strategy 3: Event-Based Invalidation (Our Solution)
We implemented a pub/sub system using Redis:
# Publisher (runs after database updates)
def publish_cache_invalidation(entity_type, entity_id):
redis_client.publish(
'cache_invalidation',
json.dumps({
'type': entity_type,
'id': entity_id,
'timestamp': time.time()
})
)
# After any product update
def update_product(product_id, new_data):
product = db.session.query(Product).get(product_id)
product.name = new_data['name']
product.price = new_data['price']
db.session.commit()
# Publish invalidation event
publish_cache_invalidation('product', product_id)
# Subscriber (runs in background worker)
import redis
def cache_invalidation_worker():
pubsub = redis_client.pubsub()
pubsub.subscribe('cache_invalidation')
for message in pubsub.listen():
if message['type'] == 'message':
data = json.loads(message['data'])
if data['type'] == 'product':
# Delete from both caches
cache_key = f"product_{data['id']}"
redis_client.delete(cache_key)
memcached_client.delete(cache_key)
# Also invalidate related caches
category_key = f"category_products_{product.category_id}"
redis_client.delete(category_key)
memcached_client.delete(category_key)
This approach works with both Redis and Memcached, but it requires Redis for the pub/sub channel. Memcached doesn't support pub/sub.
Strategy 4: Cache Tags (Redis Only)
For complex invalidation scenarios, we use Redis sets as tags:
def cache_product_with_tags(product):
cache_key = f'product_{product.id}'
# Store product data
redis_client.hset(cache_key, mapping={
'name': product.name,
'price': str(product.price),
'category_id': str(product.category_id)
})
redis_client.expire(cache_key, 3600)
# Add to tag sets
redis_client.sadd(f'tag:category_{product.category_id}', cache_key)
redis_client.sadd(f'tag:brand_{product.brand_id}', cache_key)
def invalidate_by_category(category_id):
tag_key = f'tag:category_{category_id}'
# Get all keys with this tag
keys = redis_client.smembers(tag_key)
# Delete them all
if keys:
redis_client.delete(*keys)
# Delete tag set
redis_client.delete(tag_key)
This is powerful but Redis-specific. Memcached can't do this.
The Cost Analysis Nobody Talks About
Let's talk money. Everyone focuses on performance, but what about cost?
Our Monthly AWS ElastiCache Costs:
Redis (2x cache.r6g.xlarge):
- Instance cost: 2 × $0.282/hour × 730 hours = $411.72
- Data transfer: ~$15/month
- Backup storage (snapshots): $8/month
- Total: ~$435/month
Memcached (1x cache.r6g.2xlarge):
- Instance cost: $0.564/hour × 730 hours = $411.72
- Data transfer: ~$12/month
- No backup costs (no persistence)
- Total: ~$424/month
Similar costs, but here's what we saved:
Before caching:
- 4x PostgreSQL read replicas (db.r5.2xlarge): 4 × $0.96/hour × 730 = $2,803/month
- RDS data transfer: ~$80/month
- Total: ~$2,883/month
After caching:
- 2x PostgreSQL read replicas (db.r5.xlarge): 2 × $0.48/hour × 730 = $700/month
- Redis: $435/month
- Memcached: $424/month
- RDS data transfer: ~$25/month
- Total: ~$1,584/month
Savings: $1,299/month (~45% reduction)
But the real savings came from avoiding database scaling. Without caching, we would have needed to scale to db.r5.4xlarge instances at 50M requests/day. That would have cost $3.84/hour per instance - $5,606/month for two instances.
With caching, our database costs actually went down as traffic increased.
Performance Tuning: The Deep Cuts
Here are the performance optimizations that made the biggest difference:
1. Connection Pooling (30% Improvement)
# Before: Creating new connections
def get_product_bad(product_id):
r = redis.Redis(host='...') # New connection every time!
return r.get(f'product_{product_id}')
# After: Using connection pool
pool = redis.ConnectionPool(
host='my-redis-cluster.cache.amazonaws.com',
port=6379,
max_connections=50,
socket_keepalive=True
)
redis_client = redis.Redis(connection_pool=pool)
def get_product_good(product_id):
return redis_client.get(f'product_{product_id}')
This single change reduced our P95 latency from 8.2ms to 5.7ms - a 30% improvement.
2. Pipelining (2.5x Throughput)
When fetching multiple keys, use pipelining:
# Bad: Multiple round trips
def get_products_bad(product_ids):
products = []
for pid in product_ids:
product = redis_client.get(f'product_{pid}')
if product:
products.append(json.loads(product))
return products
# Good: Single round trip
def get_products_good(product_ids):
pipe = redis_client.pipeline()
for pid in product_ids:
pipe.get(f'product_{pid}')
results = pipe.execute()
products = []
for result in results:
if result:
products.append(json.loads(result))
return products
For 100 products:
- Bad approach: 100 network round trips, ~80ms total
- Good approach: 1 network round trip, ~3ms total
2.7x faster for batch operations.
3. Serialization Format (40% Memory Reduction)
We tested different serialization formats:
import json
import pickle
import msgpack
product_data = {
'id': 12345,
'name': 'Sample Product',
'price': 29.99,
# ... more fields
}
# JSON
json_size = len(json.dumps(product_data))
# 487 bytes
# Pickle
pickle_size = len(pickle.dumps(product_data))
# 412 bytes (15% smaller)
# MessagePack
msgpack_size = len(msgpack.packb(product_data))
# 291 bytes (40% smaller!)
We switched to MessagePack for large objects. Memory usage dropped by 38%, and we could cache 60% more data in the same memory.
4. Key Naming Strategy (Faster Lookups)
We standardized our key naming:
# Bad: Inconsistent naming
cache.set(f'prod_{id}', ...)
cache.set(f'product-{id}', ...)
cache.set(f'{id}_product', ...)
# Good: Consistent, hierarchical naming
cache.set(f'product:{id}', ...)
cache.set(f'product:{id}:reviews', ...)
cache.set(f'category:{cat_id}:products', ...)
This made invalidation easier and debugging faster. We could also use Redis's SCAN command to find related keys:
# Find all product keys
cursor = 0
product_keys = []
while True:
cursor, keys = redis_client.scan(cursor, match='product:*', count=100)
product_keys.extend(keys)
if cursor == 0:
break
5. Compression for Large Values (70% Size Reduction)
For values > 1KB, we added compression:
import zlib
import json
def cache_set_compressed(key, value, timeout=3600):
json_data = json.dumps(value)
if len(json_data) > 1024: # Only compress if > 1KB
compressed = zlib.compress(json_data.encode('utf-8'))
redis_client.setex(f'{key}:compressed', timeout, compressed)
else:
redis_client.setex(key, timeout, json_data)
def cache_get_compressed(key):
# Try compressed first
compressed = redis_client.get(f'{key}:compressed')
if compressed:
json_data = zlib.decompress(compressed).decode('utf-8')
return json.loads(json_data)
# Fall back to uncompressed
json_data = redis_client.get(key)
if json_data:
return json.loads(json_data)
return None
For product descriptions (average 2KB), this reduced memory usage by 68%.
The Gotchas That Cost Us Days
Here are the painful lessons we learned:
1. Redis Maxmemory Policy Confusion
We set maxmemory-policy allkeys-lru thinking it would work for all use cases. Wrong. When we started using Redis for rate limiting (keys without expiration), Redis started evicting our rate limit counters!
# Rate limiting code
def check_rate_limit(user_id):
key = f'rate_limit:{user_id}'
count = redis_client.incr(key)
if count == 1:
redis_client.expire(key, 60) # 1 minute window
return count <= 100 # Max 100 requests per minute
# Problem: If Redis is full, it evicts rate limit keys!
# Users could bypass rate limits during high load.
Fix: We separated rate limiting into a dedicated Redis instance with maxmemory-policy noeviction. This prevents eviction and returns errors when full, which we handle gracefully:
def check_rate_limit_safe(user_id):
key = f'rate_limit:{user_id}'
try:
count = redis_client.incr(key)
if count == 1:
redis_client.expire(key, 60)
return count <= 100
except redis.exceptions.ResponseError as e:
if 'OOM' in str(e):
# Redis is full - fail closed
return False
raise
2. Memcached Silent Failures
Memcached silently drops values larger than 1MB by default. We were caching search results that occasionally exceeded this limit. No error, no warning - just cache misses.
# This silently fails if data > 1MB
memcached_client.set('search_results', large_data)
# Returns None, even though set() returned True!
result = memcached_client.get('search_results')
Fix: We added size checks and split large values:
def memcached_set_safe(key, value, timeout=3600):
serialized = pickle.dumps(value)
if len(serialized) > 900_000: # 900KB threshold
# Split into chunks
chunks = [serialized[i:i+900_000] for i in range(0, len(serialized), 900_000)]
for i, chunk in enumerate(chunks):
memcached_client.set(f'{key}:chunk:{i}', chunk, timeout)
# Store metadata
memcached_client.set(f'{key}:chunks', len(chunks), timeout)
else:
memcached_client.set(key, serialized, timeout)
3. Redis Cluster Hash Tags
When we moved to Redis Cluster for horizontal scaling, our multi-key operations broke:
# This fails in Redis Cluster if keys are on different nodes
pipe = redis_client.pipeline()
pipe.get('user:123:cart')
pipe.get('user:123:wishlist')
pipe.execute() # CROSSSLOT error!
Fix: Use hash tags to ensure related keys are on the same node:
# Force both keys to same slot using {user:123}
pipe = redis_client.pipeline()
pipe.get('user:{user:123}:cart')
pipe.get('user:{user:123}:wishlist')
pipe.execute() # Works!
The part in curly braces determines which slot the key goes to. All keys with the same hash tag go to the same node.
4. Cache Stampede
When a popular cache key expires, hundreds of requests simultaneously try to regenerate it. We saw this with our homepage product feed:
def get_homepage_products():
products = cache.get('homepage_products')
if products:
return products
# Cache miss - query database
# Problem: 100 concurrent requests all hit this code path
products = db.session.query(Product).filter_by(featured=True).all()
cache.set('homepage_products', products, timeout=300)
return products
During a cache expiration, we saw 200+ simultaneous database queries for the same data.
Fix: We implemented a cache stampede prevention using Redis locks:
import time
def get_homepage_products():
products = cache.get('homepage_products')
if products:
return products
# Try to acquire lock
lock_key = 'lock:homepage_products'
lock_acquired = redis_client.set(lock_key, '1', nx=True, ex=10)
if lock_acquired:
# We got the lock - regenerate cache
try:
products = db.session.query(Product).filter_by(featured=True).all()
cache.set('homepage_products', products, timeout=300)
return products
finally:
redis_client.delete(lock_key)
else:
# Someone else is regenerating - wait and retry
time.sleep(0.1)
products = cache.get('homepage_products')
if products:
return products
# Still not ready - query database (fallback)
return db.session.query(Product).filter_by(featured=True).all()
This reduced database load during cache regeneration by 95%.
When to Choose Redis vs Memcached
After six months in production, here's my honest recommendation:
Choose Memcached when:
- You need absolute maximum throughput for simple key-value operations
- Your data is purely ephemeral (OK to lose on restart)
- You're caching data that's easy to regenerate (database query results)
- You want the simplest possible setup
- You're operating at extreme scale (100M+ ops/sec) where every microsecond matters
Choose Redis when:
- You need data structures beyond key-value (hashes, sets, sorted sets, lists)
- You need persistence (sessions, carts, user state)
- You need pub/sub or messaging
- You need atomic operations (counters, rate limiting)
- You need better eviction policies
- You want better observability and debugging tools
- You might need advanced features later (streams, transactions, Lua scripts)
For most teams, I recommend Redis. The flexibility is worth the slight performance trade-off. You'll eventually need one of Redis's advanced features, and retrofitting is painful.
But if you're building a pure caching layer for database queries and you're absolutely sure you'll never need anything beyond key-value, Memcached is slightly faster and simpler.
Our Final Architecture
Here's what we run in production today:
Primary Redis Cluster (cache.r6g.xlarge):
- User sessions
- Shopping carts
- Rate limiting
- Real-time features (pub/sub)
- Configuration: AOF persistence, volatile-lru eviction
- ~12GB memory, 89% hit rate
Secondary Redis Cluster (cache.r6g.large):
- Product metadata (hashes)
- Search indexes (sorted sets)
- Analytics counters
- Configuration: RDB snapshots only, volatile-lru eviction
- ~6GB memory, 84% hit rate
Memcached Cluster (cache.r6g.2xlarge):
- Database query results
- API response cache
- Template fragments
- Configuration: LRU eviction, no persistence
- ~20GB memory, 91% hit rate
Performance at 50M requests/day:
- Average API response time: 87ms (down from 340ms)
- P95 response time: 210ms (down from 1,200ms)
- Database CPU usage: 45% (down from 98%)
- Cache hit rate: 88% overall
- Infrastructure cost: $1,584/month (down from $2,883/month)
We're handling 4x more traffic with 45% lower infrastructure costs. That's the power of proper caching.
What I'd Do Differently
Looking back, here's what I wish I'd known:
-
Start with Redis everywhere. We wasted time managing two caching systems. The performance difference isn't significant enough to justify the operational complexity.
-
Invest in monitoring from day one. We added detailed metrics after our first incident. Should have done it from the start.
-
Design for cache invalidation upfront. We retrofitted event-based invalidation after launch. Building it in from the beginning would have saved weeks.
-
Load test with realistic data. Our initial tests used small, uniform cache values. Real production data has high variance in size and access patterns.
-
Plan for failure modes. What happens when cache is down? When it's full? When a node fails? We learned this the hard way at 3am.
-
Document your caching strategy. Six months later, new team members are confused about what goes where and why. Write it down.
The journey from 12M to 50M requests per day taught us more about caching than any tutorial ever could. Redis and Memcached are both excellent tools - the key is understanding when to use each one and how to use them properly.
If you're facing similar scaling challenges, I hope this helps you avoid some of the mistakes we made. And if you're still on the fence between Redis and Memcached, my advice is simple: start with Redis. You can always add Memcached later if you need that extra 10% performance. But you probably won't need to.
Never Miss an Article
Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.
Comments (0)
Please log in to leave a comment.
Log InRelated Articles
Comparing NASA's Orbit Determination Program (ODP) with ESA's ORBIT14: Accuracy and Efficiency in Satellite Orbit Determination
Feb 23, 2026
Building a Complete E-commerce Website with Laravel: What We Learned Scaling to 100k Orders
Apr 30, 2026
Building a Real-Time Notification System with Laravel and Pusher: Production Lessons from 2M Daily Events
May 6, 2026