Daniel Hartwell
Listen to Article
Loading...Building a Production-Grade E-Commerce Platform with Laravel 12, Stripe, and Kubernetes - Part 8: Production Deployment & Monitoring
Estimated reading time: 35 minutes
Table of Contents
- Introduction: The Final Mile
- Pre-Deployment Production Readiness Checklist
- Blue-Green Deployment Strategy
- Production Observability Stack
- Incident Response Runbook
- Cost Optimization in Production
- Performance Monitoring & SLO Tracking
- Disaster Recovery & Business Continuity
- Lessons Learned from Production Outages
- Beyond This Series: Advanced Topics
Introduction: The Final Mile
We've built an e-commerce platform over seven parts—from domain-driven design through payment processing, event-driven architecture, and Kubernetes orchestration. Now comes the most critical phase: production deployment and operational excellence.
The reality: 80% of software project failures happen after initial deployment. Not from bugs in the code, but from operational blind spots. I've participated in three major e-commerce platform launches, and each taught expensive lessons. A database connection pool exhaustion at 3 AM during Black Friday. A memory leak that only manifested after 72 hours of uptime. A backup system that "worked" in testing but failed during an actual disaster.
This final part covers what separates projects that survive first contact with production traffic from those that don't. We'll implement real monitoring, establish incident response procedures, and build the operational foundation your platform needs to scale from launch day through the first million orders.
What we're deploying:
┌─────────────────────────────────────────────────────────────┐
│ Production Environment │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Blue │ │ Grafana │ │ Loki │ │
│ │ Environment │◄───│ Monitoring │◄──│ Logs │ │
│ │ (Current) │ │ Stack │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ │ ▼ │ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Green │ │ Prometheus │ │ Jaeger │ │
│ │ Environment │ │ Metrics │ │ Tracing │ │
│ │ (Staged) │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Automated Backup & DR System │ │
│ │ • Database snapshots every 6 hours │ │
│ │ • Cross-region replication │ │
│ │ • Point-in-time recovery capability │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Critical dependencies:
# Monitoring & Observability
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
# Update your local chart index
helm repo update
Pre-Deployment Production Readiness Checklist
Before touching production, verify every system is instrumented and validated. This checklist prevented three near-disasters in my last deployment.
1. Health Check Endpoints
Why this matters: Kubernetes uses these to determine pod health. Get them wrong, and K8s will kill healthy pods or keep broken ones running.
<?php
// app/Http/Controllers/HealthController.php
namespace App\Http\Controllers;
use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Redis;
use Illuminate\Support\Facades\Cache;
use Illuminate\Http\JsonResponse;
use Illuminate\Http\Response;
class HealthController extends Controller
{
/**
* Liveness probe - "Is the application running?"
* Should return 200 even if dependencies are down.
* Used by Kubernetes to decide if container needs restart.
*
* Failure threshold: 3 consecutive failures = pod restart
*/
public function liveness(): JsonResponse
{
// Only check if PHP-FPM is responsive
// Do NOT check database, Redis, or external dependencies
return response()->json([
'status' => 'alive',
'timestamp' => now()->toIso8601String(),
'uptime_seconds' => (int) shell_exec('cut -d. -f1 /proc/uptime'),
'memory_usage_mb' => round(memory_get_usage(true) / 1024 / 1024, 2),
]);
}
/**
* Readiness probe - "Can the application handle traffic?"
* Should verify all critical dependencies.
* Used by Kubernetes to route traffic to this pod.
*
* Failure threshold: 3 consecutive failures = removed from load balancer
*/
public function readiness(): JsonResponse
{
$checks = [];
$overall_healthy = true;
$start_time = microtime(true);
// Database connectivity check
try {
DB::connection()->getPdo();
$checks['database'] = [
'healthy' => true,
'latency_ms' => round((microtime(true) - $start_time) * 1000, 2),
];
} catch (\Exception $e) {
$checks['database'] = [
'healthy' => false,
'error' => $e->getMessage(),
];
$overall_healthy = false;
}
// Redis connectivity check
$redis_start = microtime(true);
try {
Redis::ping();
$checks['redis'] = [
'healthy' => true,
'latency_ms' => round((microtime(true) - $redis_start) * 1000, 2),
];
} catch (\Exception $e) {
$checks['redis'] = [
'healthy' => false,
'error' => $e->getMessage(),
];
$overall_healthy = false;
}
// Storage connectivity check (S3 or equivalent)
$storage_start = microtime(true);
try {
\Storage::disk('s3')->exists('.health-check');
$checks['storage'] = [
'healthy' => true,
'latency_ms' => round((microtime(true) - $storage_start) * 1000, 2),
];
} catch (\Exception $e) {
$checks['storage'] = [
'healthy' => false,
'error' => $e->getMessage(),
];
// Storage issues shouldn't prevent traffic routing
// Just log for monitoring
\Log::warning('Storage health check failed', [
'error' => $e->getMessage(),
]);
}
// Queue connectivity check
$queue_start = microtime(true);
try {
// Attempt to get queue size without processing jobs
$queue_size = Redis::llen('queues:default');
$checks['queue'] = [
'healthy' => true,
'pending_jobs' => $queue_size,
'latency_ms' => round((microtime(true) - $queue_start) * 1000, 2),
];
} catch (\Exception $e) {
$checks['queue'] = [
'healthy' => false,
'error' => $e->getMessage(),
];
$overall_healthy = false;
}
$status_code = $overall_healthy ? Response::HTTP_OK : Response::HTTP_SERVICE_UNAVAILABLE;
return response()->json([
'status' => $overall_healthy ? 'ready' : 'not_ready',
'checks' => $checks,
'timestamp' => now()->toIso8601String(),
'total_check_time_ms' => round((microtime(true) - $start_time) * 1000, 2),
], $status_code);
}
/**
* Startup probe - "Has the application finished initializing?"
* Critical for slow-starting applications (large caches, migrations, etc).
*
* Failure threshold: 30 failures (at 10s intervals) = 5 minutes before restart
*/
public function startup(): JsonResponse
{
// Check if application has completed critical initialization
$initialized = true;
$initialization_checks = [];
// Verify config cache exists (production requirement)
if (!file_exists(base_path('bootstrap/cache/config.php'))) {
$initialized = false;
$initialization_checks['config_cache'] = false;
} else {
$initialization_checks['config_cache'] = true;
}
// Verify route cache exists (production requirement)
if (!file_exists(base_path('bootstrap/cache/routes-v7.php'))) {
$initialized = false;
$initialization_checks['route_cache'] = false;
} else {
$initialization_checks['route_cache'] = true;
}
// Verify database migrations are current
try {
$pending_migrations = DB::table('migrations')->count();
$initialization_checks['migrations'] = $pending_migrations > 0;
} catch (\Exception $e) {
$initialized = false;
$initialization_checks['migrations'] = false;
}
$status_code = $initialized ? Response::HTTP_OK : Response::HTTP_SERVICE_UNAVAILABLE;
return response()->json([
'status' => $initialized ? 'initialized' : 'initializing',
'checks' => $initialization_checks,
'timestamp' => now()->toIso8601String(),
], $status_code);
}
/**
* Detailed metrics endpoint for Prometheus scraping
* This endpoint should NOT be exposed publicly
*/
public function metrics(): Response
{
$metrics = [];
// Application metrics
$metrics[] = sprintf('# HELP app_uptime_seconds Application uptime in seconds');
$metrics[] = sprintf('# TYPE app_uptime_seconds gauge');
$metrics[] = sprintf('app_uptime_seconds %d', (int) shell_exec('cut -d. -f1 /proc/uptime'));
// Memory metrics
$metrics[] = sprintf('# HELP app_memory_usage_bytes Current memory usage in bytes');
$metrics[] = sprintf('# TYPE app_memory_usage_bytes gauge');
$metrics[] = sprintf('app_memory_usage_bytes %d', memory_get_usage(true));
// Database connection pool metrics
$metrics[] = sprintf('# HELP db_connections_active Active database connections');
$metrics[] = sprintf('# TYPE db_connections_active gauge');
try {
$connections = DB::select("SHOW STATUS LIKE 'Threads_connected'");
$metrics[] = sprintf('db_connections_active %d', $connections[0]->Value ?? 0);
} catch (\Exception $e) {
$metrics[] = sprintf('db_connections_active 0');
}
// Queue metrics
$metrics[] = sprintf('# HELP queue_jobs_pending Jobs pending in queue');
$metrics[] = sprintf('# TYPE queue_jobs_pending gauge');
try {
$pending = Redis::llen('queues:default');
$metrics[] = sprintf('queue_jobs_pending %d', $pending);
} catch (\Exception $e) {
$metrics[] = sprintf('queue_jobs_pending 0');
}
// Cache hit rate (requires custom tracking)
$metrics[] = sprintf('# HELP cache_hits_total Total cache hits');
$metrics[] = sprintf('# TYPE cache_hits_total counter');
$cache_hits = Cache::get('metrics:cache:hits', 0);
$metrics[] = sprintf('cache_hits_total %d', $cache_hits);
return response(implode("\n", $metrics), 200)
->header('Content-Type', 'text/plain; version=0.0.4');
}
}
Register these routes:
<?php
// routes/web.php
// Health check endpoints - NO authentication, NO rate limiting
// These must respond quickly and reliably
Route::get('/health/live', [App\Http\Controllers\HealthController::class, 'liveness'])
->name('health.liveness');
Route::get('/health/ready', [App\Http\Controllers\HealthController::class, 'readiness'])
->name('health.readiness');
Route::get('/health/startup', [App\Http\Controllers\HealthController::class, 'startup'])
->name('health.startup');
// Metrics endpoint - restrict to internal network only
Route::get('/metrics', [App\Http\Controllers\HealthController::class, 'metrics'])
->middleware(['throttle:120,1']) // Allow high frequency scraping
->name('metrics');
Configure Kubernetes probes:
# kubernetes/deployments/laravel-web.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: laravel-web
namespace: ecommerce
spec:
replicas: 3
selector:
matchLabels:
app: laravel-web
template:
metadata:
labels:
app: laravel-web
version: v1.0.0
spec:
containers:
- name: laravel
image: ghcr.io/ibekzod/laravel-ecommerce:latest
ports:
- containerPort: 8080
name: http
protocol: TCP
# Startup probe - gives app up to 5 minutes to initialize
# Critical for first deployment or after cache clearing
startupProbe:
httpGet:
path: /health/startup
port: 8080
httpHeaders:
- name: X-Probe-Type
value: startup
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 30 # 30 failures * 10s = 5 minutes max startup time
# Liveness probe - restarts container if unhealthy
# Should almost never fail in production
livenessProbe:
httpGet:
path: /health/live
port: 8080
httpHeaders:
- name: X-Probe-Type
value: liveness
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3 # 3 failures = restart
# Readiness probe - removes from load balancer if unhealthy
# Will fail during deployment or dependency issues
readinessProbe:
httpGet:
path: /health/ready
port: 8080
httpHeaders:
- name: X-Probe-Type
value: readiness
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3 # 3 failures = remove from service
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
env:
- name: APP_ENV
value: "production"
- name: APP_DEBUG
value: "false"
- name: LOG_CHANNEL
value: "stack"
- name: DB_CONNECTION
value: "mysql"
- name: CACHE_DRIVER
value: "redis"
- name: QUEUE_CONNECTION
value: "redis"
# Database credentials from secret
- name: DB_HOST
valueFrom:
secretKeyRef:
name: laravel-secrets
key: db-host
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: laravel-secrets
key: db-password
2. Production Configuration Validation Script
Why this matters: Deploying with wrong configuration is the #1 cause of production incidents. This script catches misconfigurations before deployment.
#!/bin/bash
# scripts/validate-production-config.sh
#
# Run this script before EVERY production deployment
# It validates configuration, secrets, and infrastructure readiness
#
# Usage: ./scripts/validate-production-config.sh
set -e # Exit on any error
echo "=================================================="
echo "Production Configuration Validation"
echo "=================================================="
echo ""
# Color codes for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
VALIDATION_FAILED=0
# Function to print validation results
check_passed() {
echo -e "${GREEN}✓${NC} $1"
}
check_failed() {
echo -e "${RED}✗${NC} $1"
VALIDATION_FAILED=1
}
check_warning() {
echo -e "${YELLOW}⚠${NC} $1"
}
# 1. Verify kubectl is configured for correct cluster
echo "1. Verifying Kubernetes cluster context..."
CURRENT_CONTEXT=$(kubectl config current-context)
EXPECTED_CONTEXT="production-cluster"
if [[ "$CURRENT_CONTEXT" == "$EXPECTED_CONTEXT" ]]; then
check_passed "Connected to production cluster: $CURRENT_CONTEXT"
else
check_failed "Wrong cluster context. Expected: $EXPECTED_CONTEXT, Got: $CURRENT_CONTEXT"
echo " Switch context with: kubectl config use-context $EXPECTED_CONTEXT"
fi
# 2. Verify namespace exists
echo ""
echo "2. Verifying namespace..."
if kubectl get namespace ecommerce &> /dev/null; then
check_passed "Namespace 'ecommerce' exists"
else
check_failed "Namespace 'ecommerce' does not exist"
echo " Create with: kubectl create namespace ecommerce"
fi
# 3. Verify all required secrets exist
echo ""
echo "3. Verifying Kubernetes secrets..."
REQUIRED_SECRETS=(
"laravel-secrets"
"stripe-api-keys"
"database-credentials"
"redis-password"
"s3-credentials"
)
for secret in "${REQUIRED_SECRETS[@]}"; do
if kubectl get secret "$secret" -n ecommerce &> /dev/null; then
check_passed "Secret '$secret' exists"
# Verify secret has required keys
case $secret in
"laravel-secrets")
REQUIRED_KEYS=("app-key" "db-host" "db-password")
;;
"stripe-api-keys")
REQUIRED_KEYS=("secret-key" "webhook-secret")
;;
*)
REQUIRED_KEYS=()
;;
esac
for key in "${REQUIRED_KEYS[@]}"; do
if kubectl get secret "$secret" -n ecommerce -o jsonpath="{.data.$key}" &> /dev/null; then
check_passed " Key '$key' present in secret '$secret'"
else
check_failed " Key '$key' missing from secret '$secret'"
fi
done
else
check_failed "Secret '$secret' not found"
fi
done
# 4. Verify ConfigMaps
echo ""
echo "4. Verifying ConfigMaps..."
REQUIRED_CONFIGMAPS=(
"laravel-config"
"nginx-config"
)
for cm in "${REQUIRED_CONFIGMAPS[@]}"; do
if kubectl get configmap "$cm" -n ecommerce &> /dev/null; then
check_passed "ConfigMap '$cm' exists"
else
check_failed "ConfigMap '$cm' not found"
fi
done
# 5. Verify database connectivity
echo ""
echo "5. Verifying database connectivity..."
DB_HOST=$(kubectl get secret laravel-secrets -n ecommerce -o jsonpath="{.data.db-host}" | base64 -d)
DB_USER=$(kubectl get secret laravel-secrets -n ecommerce -o jsonpath="{.data.db-user}" | base64 -d)
DB_PASS=$(kubectl get secret laravel-secrets -n ecommerce -o jsonpath="{.data.db-password}" | base64 -d)
# Test database connection using a temporary pod
cat <<EOF | kubectl apply -f - &> /dev/null
apiVersion: v1
kind: Pod
metadata:
name: db-test
namespace: ecommerce
spec:
containers:
- name: mysql-client
image: mysql:8.0
command: ['sleep', '30']
restartPolicy: Never
EOF
sleep 3 # Wait for pod to start
if kubectl exec -n ecommerce db-test -- mysql -h"$DB_HOST" -u"$DB_USER" -p"$DB_PASS" -e "SELECT 1" &> /dev/null; then
check_passed "Database connection successful"
else
check_failed "Cannot connect to database at $DB_HOST"
fi
kubectl delete pod db-test -n ecommerce &> /dev/null
# 6. Verify Redis connectivity
echo ""
echo "6. Verifying Redis connectivity..."
REDIS_HOST=$(kubectl get configmap laravel-config -n ecommerce -o jsonpath="{.data.REDIS_HOST}")
REDIS_PASSWORD=$(kubectl get secret redis-password -n ecommerce -o jsonpath="{.data.password}" | base64 -d)
cat <<EOF | kubectl apply -f - &> /dev/null
apiVersion: v1
kind: Pod
metadata:
name: redis-test
namespace: ecommerce
spec:
containers:
- name: redis-client
image: redis:7-alpine
command: ['sleep', '30']
restartPolicy: Never
EOF
sleep 3
if kubectl exec -n ecommerce redis-test -- redis-cli -h "$REDIS_HOST" -a "$REDIS_PASSWORD" PING | grep -q "PONG"; then
check_passed "Redis connection successful"
else
check_failed "Cannot connect to Redis at $REDIS_HOST"
fi
kubectl delete pod redis-test -n ecommerce &> /dev/null
# 7. Verify persistent volumes
echo ""
echo "7. Verifying persistent volumes..."
REQUIRED_PVCS=(
"mysql-pvc"
"redis-pvc"
)
for pvc in "${REQUIRED_PVCS[@]}"; do
PVC_STATUS=$(kubectl get pvc "$pvc" -n ecommerce -o jsonpath="{.status.phase}" 2>/dev/null)
if [[ "$PVC_STATUS" == "Bound" ]]; then
check_passed "PVC '$pvc' is Bound"
else
check_failed "PVC '$pvc' is not Bound (status: $PVC_STATUS)"
fi
done
# 8. Verify container images exist and are accessible
echo ""
echo "8. Verifying container images..."
REQUIRED_IMAGES=(
"ghcr.io/ibekzod/laravel-ecommerce:latest"
"ghcr.io/ibekzod/laravel-ecommerce-worker:latest"
)
for image in "${REQUIRED_IMAGES[@]}"; do
# Try to pull image manifest (without downloading layers)
if docker manifest inspect "$image" &> /dev/null; then
check_passed "Image '$image' is accessible"
else
check_failed "Image '$image' cannot be accessed"
echo " Verify image exists and credentials are correct"
fi
done
# 9. Verify monitoring stack is running
echo ""
echo "9. Verifying monitoring stack..."
MONITORING_SERVICES=(
"prometheus-server"
"grafana"
"loki"
)
for service in "${MONITORING_SERVICES[@]}"; do
if kubectl get deployment "$service" -n monitoring &> /dev/null; then
READY=$(kubectl get deployment "$service" -n monitoring -o jsonpath="{.status.readyReplicas}")
DESIRED=$(kubectl get deployment "$service" -n monitoring -o jsonpath="{.spec.replicas}")
if [[ "$READY" == "$DESIRED" ]]; then
check_passed "$service is running ($READY/$DESIRED replicas ready)"
else
check_warning "$service has $READY/$DESIRED replicas ready"
fi
else
check_failed "$service deployment not found"
fi
done
# 10. Verify backup system is configured
echo ""
echo "10. Verifying backup configuration..."
if kubectl get cronjob database-backup -n ecommerce &> /dev/null; then
LAST_SCHEDULE=$(kubectl get cronjob database-backup -n ecommerce -o jsonpath="{.status.lastScheduleTime}")
check_passed "Database backup CronJob exists (last run: $LAST_SCHEDULE)"
else
check_failed "Database backup CronJob not found"
fi
# 11. Verify SSL/TLS certificates
echo ""
echo "11. Verifying SSL/TLS certificates..."
if kubectl get certificate ecommerce-tls -n ecommerce &> /dev/null; then
CERT_READY=$(kubectl get certificate ecommerce-tls -n ecommerce -o jsonpath="{.status.conditions[?(@.type=='Ready')].status}")
if [[ "$CERT_READY" == "True" ]]; then
check_passed "TLS certificate is ready"
# Check expiry
CERT_EXPIRY=$(kubectl get secret ecommerce-tls -n ecommerce -o jsonpath="{.data.tls\.crt}" | base64 -d | openssl x509 -noout -enddate | cut -d= -f2)
check_passed " Certificate expires: $CERT_EXPIRY"
else
check_failed "TLS certificate not ready"
fi
else
check_failed "TLS certificate not found"
fi
# 12. Verify ingress configuration
echo ""
echo "12. Verifying ingress configuration..."
if kubectl get ingress laravel-ingress -n ecommerce &> /dev/null; then
INGRESS_HOST=$(kubectl get ingress laravel-ingress -n ecommerce -o jsonpath="{.spec.rules[0].host}")
check_passed "Ingress configured for host: $INGRESS_HOST"
# Verify DNS resolution
if host "$INGRESS_HOST" &> /dev/null; then
INGRESS_IP=$(host "$INGRESS_HOST" | awk '/has address/ { print $4 }' | head -1)
check_passed " DNS resolves to: $INGRESS_IP"
else
check_warning " DNS does not resolve for $INGRESS_HOST"
fi
else
check_failed "Ingress not found"
fi
# 13. Verify resource quotas are not exceeded
echo ""
echo "13. Verifying resource quotas..."
if kubectl get resourcequota -n ecommerce &> /dev/null; then
QUOTA_OUTPUT=$(kubectl get resourcequota -n ecommerce -o json)
check_passed "Resource quotas configured"
# Parse and display quota usage
USED_CPU=$(echo "$QUOTA_OUTPUT" | jq -r '.items[0].status.used."requests.cpu" // "0"')
HARD_CPU=$(echo "$QUOTA_OUTPUT" | jq -r '.items[0].status.hard."requests.cpu" // "unlimited"')
echo " CPU: $USED_CPU / $HARD_CPU"
USED_MEM=$(echo "$QUOTA_OUTPUT" | jq -r '.items[0].status.used."requests.memory" // "0"')
HARD_MEM=$(echo "$QUOTA_OUTPUT" | jq -r '.items[0].status.hard."requests.memory" // "unlimited"')
echo " Memory: $USED_MEM / $HARD_MEM"
else
check_warning "No resource quotas configured"
fi
# Final summary
echo ""
echo "=================================================="
if [ $VALIDATION_FAILED -eq 0 ]; then
echo -e "${GREEN}✓ All validation checks passed!${NC}"
echo " Production deployment can proceed."
echo "=================================================="
exit 0
else
echo -e "${RED}✗ Validation failed!${NC}"
echo " Fix the issues above before deploying to production."
echo "=================================================="
exit 1
fi
Make it executable and run before every deployment:
chmod +x scripts/validate-production-config.sh
# Run validation
./scripts/validate-production-config.sh
# Example output when issues found:
# ==================================================
# Production Configuration Validation
# ==================================================
#
# 1. Verifying Kubernetes cluster context...
# ✓ Connected to production cluster: production-cluster
#
# 2. Verifying namespace...
# ✓ Namespace 'ecommerce' exists
#
# 3. Verifying Kubernetes secrets...
# ✓ Secret 'laravel-secrets' exists
# ✓ Key 'app-key' present in secret 'laravel-secrets'
# ✗ Key 'db-password' missing from secret 'laravel-secrets'
# ✗ Secret 'stripe-api-keys' not found
#
# ==================================================
# ✗ Validation failed!
# Fix the issues above before deploying to production.
# ==================================================
Blue-Green Deployment Strategy
Blue-green deployment enables zero-downtime releases by maintaining two identical production environments. Traffic switches instantly between them.
Why this matters: Rolling updates still cause issues during deployment. A bad release can affect 30% of users before you realize. Blue-green deployment allows instant rollback and validates releases before switching traffic.
Complete Blue-Green Implementation
# kubernetes/deployments/blue-green-deploy.yaml
apiVersion: v1
kind: Namespace
metadata:
name: ecommerce
---
# Blue deployment (currently serving production traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
name: laravel-blue
namespace: ecommerce
labels:
app: laravel
environment: blue
version: v1.2.3
spec:
replicas: 3
selector:
matchLabels:
app: laravel
environment: blue
template:
metadata:
labels:
app: laravel
environment: blue
version: v1.2.3
spec:
containers:
- name: laravel
image: ghcr.io/ibekzod/laravel-ecommerce:v1.2.3
ports:
- containerPort: 8080
envFrom:
- configMapRef:
name: laravel-config
- secretRef:
name: laravel-secrets
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
---
# Green deployment (new version, not yet serving traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
name: laravel-green
namespace: ecommerce
labels:
app: laravel
environment: green
version: v1.3.0
spec:
replicas: 3
selector:
matchLabels:
app: laravel
environment: green
template:
metadata:
labels:
app: laravel
environment: green
version: v1.3.0
spec:
containers:
- name: laravel
image: ghcr.io/ibekzod/laravel-ecommerce:v1.3.0 # New version
ports:
- containerPort: 8080
envFrom:
- configMapRef:
name: laravel-config
- secretRef:
name: laravel-secrets
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
---
# Service - routes to whichever environment is "active"
apiVersion: v1
kind: Service
metadata:
name: laravel-service
namespace: ecommerce
spec:
selector:
app: laravel
environment: blue # Change this to "green" to switch traffic
ports:
- protocol: TCP
port: 80
targetPort: 8080
type: ClusterIP
---
# Blue service - for direct access during testing
apiVersion: v1
kind: Service
metadata:
name: laravel-blue-service
namespace: ecommerce
spec:
selector:
app: laravel
environment: blue
ports:
- protocol: TCP
port: 80
targetPort: 8080
type: ClusterIP
---
# Green service - for direct access during testing
apiVersion: v1
kind: Service
metadata:
name: laravel-green-service
namespace: ecommerce
spec:
selector:
app: laravel
environment: green
ports:
- protocol: TCP
port: 80
targetPort: 8080
type: ClusterIP
Automated blue-green deployment script:
#!/bin/bash
# scripts/blue-green-deploy.sh
#
# Automated blue-green deployment with smoke tests and rollback capability
#
# Usage: ./scripts/blue-green-deploy.sh <new-version> <environment-to-deploy>
# Example: ./scripts/blue-green-deploy.sh v1.3.0 green
set -e
NEW_VERSION=$1
TARGET_ENV=$2 # "blue" or "green"
if [[ -z "$NEW_VERSION" ]] || [[ -z "$TARGET_ENV" ]]; then
echo "Usage: $0 <version> <environment>"
echo "Example: $0 v1.3.0 green"
exit 1
fi
# Determine which environment is currently active
CURRENT_ENV=$(kubectl get service laravel-service -n ecommerce -o jsonpath='{.spec.selector.environment}')
echo "Current active environment: $CURRENT_ENV"
echo "Deploying version $NEW_VERSION to $TARGET_ENV environment"
if [[ "$CURRENT_ENV" == "$TARGET_ENV" ]]; then
echo "ERROR: Cannot deploy to currently active environment"
echo "Deploy to the inactive environment first, then switch traffic"
exit 1
fi
# Step 1: Deploy new version to target environment
echo ""
echo "Step 1: Deploying $NEW_VERSION to $TARGET_ENV..."
kubectl set image deployment/laravel-$TARGET_ENV \
laravel=ghcr.io/ibekzod/laravel-ecommerce:$NEW_VERSION \
-n ecommerce
# Wait for rollout to complete
echo "Waiting for deployment to complete..."
kubectl rollout status deployment/laravel-$TARGET_ENV -n ecommerce --timeout=5m
# Step 2: Run database migrations on new version (if any)
echo ""
echo "Step 2: Running database migrations..."
MIGRATION_POD=$(kubectl get pod -n ecommerce -l app=laravel,environment=$TARGET_ENV -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n ecommerce $MIGRATION_POD -- php artisan migrate --force
# Step 3: Wait for all pods to be ready
echo ""
echo "Step 3: Waiting for all pods to be ready..."
kubectl wait --for=condition=ready pod \
-l app=laravel,environment=$TARGET_ENV \
-n ecommerce \
--timeout=5m
# Step 4: Run smoke tests against new environment
echo ""
echo "Step 4: Running smoke tests against $TARGET_ENV environment..."
# Get the service endpoint for the target environment
TARGET_SERVICE="laravel-${TARGET_ENV}-service"
TARGET_URL="http://${TARGET_SERVICE}.ecommerce.svc.cluster.local"
# Create a test pod to run smoke tests from inside the cluster
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: smoke-test
namespace: ecommerce
spec:
containers:
- name: curl
image: curlimages/curl:latest
command: ['sleep', '300']
restartPolicy: Never
EOF
sleep 5 # Wait for pod to start
# Test 1: Health check
echo " Testing health endpoint..."
kubectl exec -n ecommerce smoke-test -- curl -f -s "$TARGET_URL/health/ready" > /dev/null
echo " ✓ Health check passed"
# Test 2: Homepage loads
echo " Testing homepage..."
kubectl exec -n ecommerce smoke-test -- curl -f -s "$TARGET_URL/" > /dev/null
echo " ✓ Homepage loads"
# Test 3: API endpoint
echo " Testing API endpoint..."
kubectl exec -n ecommerce smoke-test -- curl -f -s "$TARGET_URL/api/products?limit=1" > /dev/null
echo " ✓ API responds"
# Test 4: Database connectivity
echo " Testing database connectivity..."
kubectl exec -n ecommerce $MIGRATION_POD -- php artisan tinker --execute="DB::connection()->getPdo();"
echo " ✓ Database connected"
# Clean up test pod
kubectl delete pod smoke-test -n ecommerce
echo ""
echo "✓ All smoke tests passed"
# Step 5: Prompt to switch traffic
echo ""
echo "=================================================="
echo "Deployment to $TARGET_ENV environment complete!"
echo "New version $NEW_VERSION is ready but not receiving traffic."
echo ""
echo "To switch traffic to $TARGET_ENV environment, run:"
echo " kubectl patch service laravel-service -n ecommerce -p '{\"spec\":{\"selector\":{\"environment\":\"$TARGET_ENV\"}}}'"
echo ""
echo "To rollback if issues occur, run:"
echo " kubectl patch service laravel-service -n ecommerce -p '{\"spec\":{\"selector\":{\"environment\":\"$CURRENT_ENV\"}}}'"
echo "=================================================="
echo ""
# Optional: Automated traffic switch (uncomment if you want automatic switch)
# read -p "Switch traffic to $TARGET_ENV now? (y/N): " -n 1 -r
# echo
# if [[ $REPLY =~ ^[Yy]$ ]]; then
# echo "Switching traffic to $TARGET_ENV..."
# kubectl patch service laravel-service -n ecommerce -p "{\"spec\":{\"selector\":{\"environment\":\"$TARGET_ENV\"}}}"
# echo "✓ Traffic switched to $TARGET_ENV"
# echo "Monitor metrics at: https://grafana.yourdomain.com"
# fi
Run the deployment:
# Make script executable
chmod +x scripts/blue-green-deploy.sh
# Deploy new version to green environment
./scripts/blue-green-deploy.sh v1.3.0 green
# Output:
# Current active environment: blue
# Deploying version v1.3.0 to green environment
#
# Step 1: Deploying v1.3.0 to green...
# deployment.apps/laravel-green image updated
# Waiting for deployment to complete...
# deployment "laravel-green" successfully rolled out
#
# Step 2: Running database migrations...
# Nothing to migrate.
#
# Step 3: Waiting for all pods to be ready...
# pod/laravel-green-7d4f8c9b5-2xkwp condition met
# pod/laravel-green-7d4f8c9b5-8hjnm condition met
# pod/laravel-green-7d4f8c9b5-qz9rt condition met
#
# Step 4: Running smoke tests against green environment...
# Testing health endpoint...
# ✓ Health check passed
# Testing homepage...
# ✓ Homepage loads
# Testing API endpoint...
# ✓ API responds
# Testing database connectivity...
# ✓ Database connected
#
# ✓ All smoke tests passed
#
# ==================================================
# Deployment to green environment complete!
# New version v1.3.0 is ready but not receiving traffic.
#
# To switch traffic to green environment, run:
# kubectl patch service laravel-service -n ecommerce -p '{"spec":{"selector":{"environment":"green"}}}'
#
# To rollback if issues occur, run:
# kubectl patch service laravel-service -n ecommerce -p '{"spec":{"selector":{"environment":"blue"}}}'
# ==================================================
# After verifying everything looks good, switch traffic:
kubectl patch service laravel-service -n ecommerce \
-p '{"spec":{"selector":{"environment":"green"}}}'
# If issues occur, instant rollback:
kubectl patch service laravel-service -n ecommerce \
-p '{"spec":{"selector":{"environment":"blue"}}}'
The beauty of blue-green: Traffic switches in < 1 second. No gradual rollout. Either the new version works completely or you're back on the old version instantly.
Production Observability Stack
Observability is not monitoring. Monitoring tells you what is broken. Observability tells you why it's broken. We need both.
Complete Observability Stack Deployment
# Install Prometheus + Grafana + Loki
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Create monitoring namespace
kubectl create namespace monitoring
# Install Prometheus (metrics collection)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi \
--set grafana.enabled=true \
--set grafana.adminPassword=ChangeThisPassword123! \
--values - <<EOF
prometheus:
prometheusSpec:
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
additionalScrapeConfigs:
- job_name: 'laravel-metrics'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- ecommerce
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: \$1:\$2
target_label: __address__
grafana:
persistence:
enabled: true
size: 10Gi
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/default
dashboards:
default:
laravel-application:
url: https://grafana.com/api/dashboards/14504/revisions/1/download
kubernetes-cluster:
url: https://grafana.com/api/dashboards/7249/revisions/1/download
EOF
# Install Loki (log aggregation)
helm install loki grafana/loki-stack \
--namespace monitoring \
--set loki.persistence.enabled=true \
--set loki.persistence.size=50Gi \
--set promtail.enabled=true
# Install Jaeger (distributed tracing)
helm install jaeger jaegertracing/jaeger \
--namespace monitoring \
--set provisionDataStore.cassandra=false \
--set allInOne.enabled=true \
--set storage.type=memory \
--set agent.enabled=false \
--set collector.enabled=false \
--set query.enabled=false
echo "Observability stack installed!"
echo ""
echo "Access Grafana:"
echo " kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80"
echo " Then visit: http://localhost:3000"
echo " Username: admin"
echo " Password: ChangeThisPassword123!"
echo ""
echo "Access Prometheus:"
echo " kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090"
echo " Then visit: http://localhost:9090"
echo ""
echo "Access Jaeger:"
echo " kubectl port-forward -n monitoring svc/jaeger-query 16686:16686"
echo " Then visit: http://localhost:16686"
Application Instrumentation for Observability
Add Prometheus metrics to Laravel:
<?php
// app/Http/Middleware/PrometheusMetrics.php
namespace App\Http\Middleware;
use Closure;
use Illuminate\Http\Request;
use Illuminate\Support\Facades\Cache;
use Symfony\Component\HttpFoundation\Response;
class PrometheusMetrics
{
/**
* Collect application metrics for Prometheus scraping
*/
public function handle(Request $request, Closure $next): Response
{
$start_time = microtime(true);
$start_memory = memory_get_usage(true);
// Process request
$response = $next($request);
// Calculate metrics
$duration = microtime(true) - $start_time;
$memory_used = memory_get_usage(true) - $start_memory;
// Increment request counter
$this->incrementMetric('http_requests_total', [
'method' => $request->method(),
'path' => $request->route()?->uri() ?? 'unknown',
'status' => $response->getStatusCode(),
]);
// Record request duration histogram
$this->recordHistogram('http_request_duration_seconds', $duration, [
'method' => $request->method(),
'path' => $request->route()?->uri() ?? 'unknown',
]);
// Record memory usage
$this->recordGauge('http_request_memory_bytes', $memory_used, [
'method' => $request->method(),
]);
// Track database queries
$query_count = count(\DB::getQueryLog());
if ($query_count > 0) {
$this->recordHistogram('database_queries_per_request', $query_count, [
'path' => $request->route()?->uri() ?? 'unknown',
]);
}
return $response;
}
/**
* Increment a counter metric
*/
private function incrementMetric(string $name, array $labels = []): void
{
$key = $this->buildMetricKey($name, $labels);
Cache::increment($key);
// Store metric metadata for exposition
$this->storeMetricMetadata($name, 'counter', $labels);
}
/**
* Record a histogram value
* Simplified implementation - production would use proper histogram buckets
*/
private function recordHistogram(string $name, float $value, array $labels = []): void
{
// Store in Redis sorted set for percentile calculations
$key = $this->buildMetricKey($name, $labels);
// Store value with timestamp as score
\Redis::zadd("histogram:$key", time(), $value);
// Keep only last 1000 values
\Redis::zremrangebyrank("histogram:$key", 0, -1001);
$this->storeMetricMetadata($name, 'histogram', $labels);
}
/**
* Set a gauge value
*/
private function recordGauge(string $name, float $value, array $labels = []): void
{
$key = $this->buildMetricKey($name, $labels);
Cache::put($key, $value, now()->addMinutes(5));
$this->storeMetricMetadata($name, 'gauge', $labels);
}
/**
* Build cache key from metric name and labels
*/
private function buildMetricKey(string $name, array $labels): string
{
$label_string = '';
foreach ($labels as $key => $value) {
$label_string .= "{$key}=\"{$value}\",";
}
return "metrics:{$name}{" . rtrim($label_string, ',') . "}";
}
/**
* Store metric metadata for exposition format
*/
private function storeMetricMetadata(string $name, string $type, array $labels): void
{
$metadata = [
'type' => $type,
'labels' => array_keys($labels),
];
Cache::put("metrics:metadata:$name", $metadata, now()->addHours(24));
}
}
Register middleware:
<?php
// app/Http/Kernel.php
protected $middleware = [
// ... other middleware
\App\Http\Middleware\PrometheusMetrics::class,
];
Create custom Grafana dashboard configuration:
{
"dashboard": {
"title": "Laravel E-Commerce Platform - Production Metrics",
"tags": ["laravel", "ecommerce", "production"],
"timezone": "browser",
"panels": [
{
"title": "Request Rate (req/s)",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{namespace=\"ecommerce\"}[5m])",
"legendFormat": "{{method}} {{path}}"
}
],
"yaxes": [
{
"label": "Requests/sec"
}
]
},
{
"title": "Response Time (p95)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{namespace=\"ecommerce\"}[5m]))",
"legendFormat": "{{method}} {{path}}"
}
],
"yaxes": [
{
"label": "Seconds",
"format": "s"
}
]
},
{
"title": "Error Rate (%)",
"type": "graph",
"targets": [
{
"expr": "100 * (rate(http_requests_total{namespace=\"ecommerce\",status=~\"5..\"}[5m]) / rate(http_requests_total{namespace=\"ecommerce\"}[5m]))",
"legendFormat": "5xx Errors"
}
],
"yaxes": [
{
"label": "Percentage",
"format": "percent"
}
],
"alert": {
"conditions": [
{
"evaluator": {
"params": [1],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": ["A", "5m", "now"]
},
"reducer": {
"params": [],
"type": "avg"
},
"type": "query"
}
],
"executionErrorState": "alerting",
"for": "5m",
"frequency": "1m",
"handler": 1,
"name": "High Error Rate Alert",
"noDataState": "no_data",
"notifications": []
}
},
{
"title": "Database Query Count per Request",
"type": "graph",
"targets": [
{
"expr": "rate(database_queries_per_request_sum{namespace=\"ecommerce\"}[5m]) / rate(database_queries_per_request_count{namespace=\"ecommerce\"}[5m])",
"legendFormat": "{{path}}"
}
],
"yaxes": [
{
"label": "Queries"
}
]
},
{
"title": "Active Pods",
"type": "stat",
"targets": [
{
"expr": "count(kube_pod_status_phase{namespace=\"ecommerce\",phase=\"Running\"})",
"legendFormat": "Running Pods"
}
]
},
{
"title": "Pod CPU Usage",
"type": "graph",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"ecommerce\",pod=~\"laravel.*\"}[5m])) by (pod)",
"legendFormat": "{{pod}}"
}
],
"yaxes": [
{
"label": "CPU Cores"
}
]
},
{
"title": "Pod Memory Usage",
"type": "graph",
"targets": [
{
"expr": "sum(container_memory_usage_bytes{namespace=\"ecommerce\",pod=~\"laravel.*\"}) by (pod)",
"legendFormat": "{{pod}}"
}
],
"yaxes": [
{
"label": "Bytes",
"format": "bytes"
}
]
},
{
"title": "Redis Operations Rate",
"type": "graph",
"targets": [
{
"expr": "rate(redis_commands_processed_total{namespace=\"ecommerce\"}[5m])",
"legendFormat": "Commands/sec"
}
]
},
{
"title": "Queue Job Processing Rate",
"type": "graph",
"targets": [
{
"expr": "rate(queue_jobs_processed_total{namespace=\"ecommerce\"}[5m])",
"legendFormat": "{{queue}}"
}
]
},
{
"title": "Failed Jobs (last hour)",
"type": "stat",
"targets": [
{
"expr": "sum(increase(queue_jobs_failed_total{namespace=\"ecommerce\"}[1h]))"
}
],
"thresholds": [
{
"value": 0,
"color": "green"
},
{
"value": 10,
"color": "yellow"
},
{
"value": 100,
"color": "red"
}
]
}
]
}
}
Save this as monitoring/grafana-dashboard.json and import into Grafana.
Incident Response Runbook
When production breaks at 2 AM, you need a runbook. Not documentation—a step-by-step recovery procedure.
Incident Classification & Response Times
| Severity | Impact | Response Time | Example |
|---|---|---|---|
| P0 - Critical | Complete service outage | < 15 minutes | Database down, all requests failing |
| P1 - High | Major feature broken | < 1 hour | Payment processing failing |
| P2 - Medium | Degraded performance | < 4 hours | Slow response times, increased errors |
| P3 - Low | Minor issue | < 24 hours | Non-critical feature bug |
P0: Critical Incident Response Procedure
#!/bin/bash
# runbooks/p0-incident-response.sh
#
# Execute this immediately when P0 incident is detected
# This script gathers diagnostic information and prepares rollback
set -e
echo "=========================================="
echo "P0 CRITICAL INCIDENT RESPONSE"
echo "Started at: $(date)"
echo "=========================================="
echo ""
# Create incident directory
INCIDENT_ID="incident-$(date +%Y%m%d-%H%M%S)"
mkdir -p "incidents/$INCIDENT_ID"
cd "incidents/$INCIDENT_ID"
echo "Incident ID: $INCIDENT_ID"
echo "Collecting diagnostic information..."
echo ""
# 1. Capture current cluster state
echo "1. Capturing cluster state..."
kubectl get all -n ecommerce > cluster-state.txt
kubectl get events -n ecommerce --sort-by='.lastTimestamp' > events.txt
kubectl top nodes > node-resources.txt
kubectl top pods -n ecommerce > pod-resources.txt
# 2. Check pod status
echo "2. Checking pod health..."
kubectl get pods -n ecommerce -o wide > pods-detailed.txt
UNHEALTHY_PODS=$(kubectl get pods -n ecommerce --field-selector=status.phase!=Running -o name)
if [[ -n "$UNHEALTHY_PODS" ]]; then
echo "ALERT: Unhealthy pods detected:"
echo "$UNHEALTHY_PODS"
# Get logs from unhealthy pods
for pod in $UNHEALTHY_PODS; do
POD_NAME=$(echo $pod | cut -d'/' -f2)
echo " Collecting logs from $POD_NAME..."
kubectl logs $pod -n ecommerce --tail=500 > "logs-${POD_NAME}.txt" 2>&1 || true
kubectl describe $pod -n ecommerce > "describe-${POD_NAME}.txt" 2>&1 || true
done
fi
# 3. Check service endpoints
echo "3. Checking service endpoints..."
kubectl get endpoints -n ecommerce > endpoints.txt
SERVICE_ENDPOINTS=$(kubectl get endpoints laravel-service -n ecommerce -o jsonpath='{.subsets[*].addresses[*].ip}' | wc -w)
if [[ $SERVICE_ENDPOINTS -eq 0 ]]; then
echo "CRITICAL: No healthy endpoints for laravel-service!"
echo "Service has zero ready pods - complete outage"
fi
# 4. Check recent deployments
echo "4. Checking recent deployments..."
kubectl rollout history deployment -n ecommerce > deployment-history.txt
# 5. Query Prometheus for error rates
echo "5. Querying error metrics..."
# This requires prometheus to be accessible
PROM_URL="http://prometheus-kube-prometheus-prometheus.monitoring:9090"
ERROR_RATE=$(curl -s "$PROM_URL/api/v1/query?query=rate(http_requests_total{namespace=\"ecommerce\",status=~\"5..\"}[5m])" | jq -r '.data.result[0].value[1]' 2>/dev/null || echo "N/A")
echo "Current 5xx error rate: $ERROR_RATE errors/sec"
# 6. Check database connectivity
echo "6. Checking database connectivity..."
DB_POD=$(kubectl get pod -n ecommerce -l app=mysql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
if [[ -n "$DB_POD" ]]; then
kubectl exec -n ecommerce $DB_POD -- mysql -e "SELECT 1" > db-connectivity.txt 2>&1 && \
echo " Database is responding" || \
echo " ERROR: Database is not responding"
# Check database connections
kubectl exec -n ecommerce $DB_POD -- mysql -e "SHOW PROCESSLIST" > db-connections.txt 2>&1
# Check database locks
kubectl exec -n ecommerce $DB_POD -- mysql -e "SHOW ENGINE INNODB STATUS\G" > db-locks.txt 2>&1
else
echo " ERROR: Cannot find database pod"
fi
# 7. Check Redis connectivity
echo "7. Checking Redis connectivity..."
REDIS_POD=$(kubectl get pod -n ecommerce -l app=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
if [[ -n "$REDIS_POD" ]]; then
kubectl exec -n ecommerce $REDIS_POD -- redis-cli PING > redis-connectivity.txt 2>&1 && \
echo " Redis is responding" || \
echo " ERROR: Redis is not responding"
# Check Redis memory usage
kubectl exec -n ecommerce $REDIS_POD -- redis-cli INFO memory > redis-memory.txt 2>&1
# Check queue sizes
kubectl exec -n ecommerce $REDIS_POD -- redis-cli LLEN queues:default > queue-size.txt 2>&1
else
echo " ERROR: Cannot find Redis pod"
fi
# 8. Generate incident summary
echo ""
echo "=========================================="
echo "DIAGNOSTIC SUMMARY"
echo "=========================================="
cat > incident-summary.txt <<EOF
Incident ID: $INCIDENT_ID
Timestamp: $(date)
Severity: P0 - Critical
CLUSTER STATE:
- Total pods: $(kubectl get pods -n ecommerce --no-headers | wc -l)
- Running pods: $(kubectl get pods -n ecommerce --field-selector=status.phase=Running --no-headers | wc -l)
- Failed pods: $(kubectl get pods -n ecommerce --field-selector=status.phase=Failed --no-headers | wc -l)
- Service endpoints: $SERVICE_ENDPOINTS
RECENT EVENTS:
$(kubectl get events -n ecommerce --sort-by='.lastTimestamp' | tail -10)
ERROR RATE:
- 5xx errors/sec: $ERROR_RATE
RECOMMENDED ACTIONS:
1. Review pod logs in logs-*.txt files
2. Check deployment-history.txt for recent changes
3. Verify database connectivity (db-connectivity.txt)
4. If recent deployment, consider rollback:
kubectl rollout undo deployment/laravel-blue -n ecommerce
kubectl rollout undo deployment/laravel-green -n ecommerce
5. If database issue, check db-locks.txt for deadlocks
6. If memory issue, check pod-resources.txt
ROLLBACK COMMANDS:
# Rollback to previous deployment
kubectl rollout undo deployment/laravel-blue -n ecommerce
kubectl rollout undo deployment/laravel-green -n ecommerce
# Or switch blue-green environment (if using blue-green)
kubectl patch service laravel-service -n ecommerce -p '{"spec":{"selector":{"environment":"blue"}}}'
# Scale up if pods are down
kubectl scale deployment/laravel-blue -n ecommerce --replicas=5
# Restart all pods (last resort)
kubectl rollout restart deployment/laravel-blue -n ecommerce
kubectl rollout restart deployment/laravel-green -n ecommerce
EOF
cat incident-summary.txt
echo ""
echo "=========================================="
echo "Diagnostic information saved to: incidents/$INCIDENT_ID"
echo ""
echo "Next steps:"
echo "1. Review incident-summary.txt"
echo "2. Execute rollback if needed"
echo "3. Notify team via Slack/PagerDuty"
echo "4. Create post-incident report"
echo "=========================================="
Make it executable:
chmod +x runbooks/p0-incident-response.sh
# Run during incident
./runbooks/p0-incident-response.sh
# Output saved to incidents/incident-20240315-143022/
# Review files and execute recommended actions
Automated Alerting Configuration
Configure PagerDuty integration with Prometheus AlertManager:
# monitoring/alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-config
namespace: monitoring
type: Opaque
stringData:
alertmanager.yml: |
global:
resolve_timeout: 5m
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'team-pagerduty'
routes:
# P0 - Immediate page
- match:
severity: critical
receiver: 'team-pagerduty'
continue: true
# P1 - Page during business hours, urgent slack otherwise
- match:
severity: warning
receiver: 'team-slack'
# P2/P3 - Slack only
- match:
severity: info
receiver: 'team-slack'
receivers:
- name: 'team-pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
severity: '{{ .CommonLabels.severity }}'
details:
firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'
num_firing: '{{ .Alerts.Firing | len }}'
num_resolved: '{{ .Alerts.Resolved | len }}'
- name: 'team-slack'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL'
channel: '#production-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
send_resolved: true
Define Prometheus alert rules:
# monitoring/prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: laravel-ecommerce-alerts
namespace: monitoring
spec:
groups:
- name: application
interval: 30s
rules:
# P0 - Critical: Complete service outage
- alert: ServiceDown
expr: up{job="laravel-metrics"} == 0
for: 1m
labels:
severity: critical
priority: P0
annotations:
summary: "Laravel service is down"
description: "Laravel application in namespace {{ $labels.namespace }} has been down for more than 1 minute"
runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-ServiceDown"
# P0 - Critical: High error rate
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{namespace="ecommerce",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{namespace="ecommerce"}[5m]))
) > 0.05
for: 2m
labels:
severity: critical
priority: P0
annotations:
summary: "Error rate above 5%"
description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"
runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-HighErrorRate"
# P0 - Critical: Database down
- alert: DatabaseDown
expr: up{job="mysql-exporter"} == 0
for: 1m
labels:
severity: critical
priority: P0
annotations:
summary: "Database is unreachable"
description: "Cannot connect to MySQL database in namespace {{ $labels.namespace }}"
runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-DatabaseDown"
# P1 - High: Slow response times
- alert: HighResponseTime
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket{namespace="ecommerce"}[5m])
) > 2
for: 5m
labels:
severity: warning
priority: P1
annotations:
summary: "95th percentile response time above 2 seconds"
description: "P95 response time is {{ $value | humanizeDuration }} for {{ $labels.path }}"
runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-SlowResponses"
# P1 - High: Queue backup
- alert: QueueBacklog
expr: queue_jobs_pending > 1000
for: 10m
labels:
severity: warning
priority: P1
annotations:
summary: "Queue has {{ $value }} pending jobs"
description: "Queue {{ $labels.queue }} has been backed up for 10 minutes"
runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-QueueBacklog"
# P1 - High: Pod crashlooping
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total{namespace="ecommerce"}[15m]) > 0
for: 5m
labels:
severity: warning
priority: P1
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Pod has restarted {{ $value }} times in the last 15 minutes"
runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-CrashLoop"
# P2 - Medium: High memory usage
- alert: HighMemoryUsage
expr: |
(
container_memory_usage_bytes{namespace="ecommerce",pod=~"laravel.*"}
/
container_spec_memory_limit_bytes{namespace="ecommerce",pod=~"laravel.*"}
) > 0.85
for: 10m
labels:
severity: warning
priority: P2
annotations:
summary: "Pod {{ $labels.pod }} memory usage above 85%"
description: "Memory usage is {{ $value | humanizePercentage }}"
runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-HighMemory"
# P2 - Medium: High CPU usage
- alert: HighCPUUsage
expr: |
(
rate(container_cpu_usage_seconds_total{namespace="ecommerce",pod=~"laravel.*"}[5m])
/
container_spec_cpu_quota{namespace="ecommerce",pod=~"laravel.*"}
) > 0.85
for: 10m
labels:
severity: warning
priority: P2
annotations:
summary: "Pod {{ $labels.pod }} CPU usage above 85%"
description: "CPU usage is {{ $value | humanizePercentage }}"
runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-HighCPU"
# P2 - Medium: Disk space running low
- alert: DiskSpaceLow
expr: |
(
kubelet_volume_stats_available_bytes{namespace="ecommerce"}
/
kubelet_volume_stats_capacity_bytes{namespace="ecommerce"}
) < 0.15
for: 5m
labels:
severity: warning
priority: P2
annotations:
summary: "Disk space below 15% on {{ $labels.persistentvolumeclaim }}"
description: "Only {{ $value | humanizePercentage }} disk space remaining"
runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-DiskSpace"
# P3 - Low: Certificate expiring soon
- alert: CertificateExpiringSoon
expr: (certmanager_certificate_expiration_timestamp_seconds - time()) < (7 * 24 * 3600)
for: 1h
labels:
severity: info
priority: P3
annotations:
summary: "Certificate {{ $labels.name }} expires in {{ $value | humanizeDuration }}"
description: "TLS certificate will expire soon, renewal required"
runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-CertRenewal"
- name: business-metrics
interval: 1m
rules:
# Business metric: Payment processing failure rate
- alert: PaymentFailureSpike
expr: |
(
rate(payment_transactions_total{status="failed"}[5m])
/
rate(payment_transactions_total[5m])
) > 0.10
for: 3m
labels:
severity: critical
priority: P0
team: payments
annotations:
summary: "Payment failure rate above 10%"
description: "{{ $value | humanizePercentage }} of payments are failing"
impact: "Direct revenue loss"
runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-PaymentFailures"
# Business metric: Order processing stalled
- alert: NoOrdersProcessed
expr: rate(orders_created_total[10m]) == 0
for: 10m
labels:
severity: warning
priority: P1
team: orders
annotations:
summary: "No orders processed in last 10 minutes"
description: "Order creation has stopped - possible system issue"
runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-OrderProcessing"
Apply the configurations:
# Apply alert rules
kubectl apply -f monitoring/prometheus-rules.yaml
# Apply alertmanager config
kubectl apply -f monitoring/alertmanager-config.yaml
# Restart alertmanager to pick up new config
kubectl rollout restart statefulset/alertmanager-prometheus-kube-prometheus-alertmanager -n monitoring
Cost Optimization in Production
Production costs spiral without active management. Here's how to optimize without sacrificing reliability.
1. Right-Sizing Resources Based on Actual Usage
#!/bin/bash
# scripts/analyze-resource-usage.sh
#
# Analyzes actual resource usage vs requested/limits
# Identifies opportunities for cost reduction
echo "Resource Usage Analysis"
echo "======================="
echo ""
# Get all pods in ecommerce namespace
PODS=$(kubectl get pods -n ecommerce -o jsonpath='{.items[*].metadata.name}')
echo "Pod Resource Usage vs Requests/Limits:"
echo ""
printf "%-40s %10s %10s %10s %10s %15s\n" "POD" "CPU_USE" "CPU_REQ" "MEM_USE" "MEM_REQ" "OPTIMIZATION"
for pod in $PODS; do
# Get actual usage
CPU_USAGE=$(kubectl top pod $pod -n ecommerce --no-headers | awk '{print $2}')
MEM_USAGE=$(kubectl top pod $pod -n ecommerce --no-headers | awk '{print $3}')
# Get requests
CPU_REQUEST=$(kubectl get pod $pod -n ecommerce -o jsonpath='{.spec.containers[0].resources.requests.cpu}')
MEM_REQUEST=$(kubectl get pod $pod -n ecommerce -o jsonpath='{.spec.containers[0].resources.requests.memory}')
# Calculate utilization
CPU_USAGE_NUM=$(echo $CPU_USAGE | sed 's/m//')
CPU_REQUEST_NUM=$(echo $CPU_REQUEST | sed 's/m//')
if [[ $CPU_REQUEST_NUM -gt 0 ]]; then
CPU_UTIL=$((CPU_USAGE_NUM * 100 / CPU_REQUEST_NUM))
else
CPU_UTIL=0
fi
# Provide optimization recommendation
if [[ $CPU_UTIL -lt 30 ]]; then
OPTIMIZATION="REDUCE_REQUESTS"
elif [[ $CPU_UTIL -gt 80 ]]; then
OPTIMIZATION="INCREASE_REQUESTS"
else
OPTIMIZATION="OK"
fi
printf "%-40s %10s %10s %10s %10s %15s\n" \
"$pod" "$CPU_USAGE" "$CPU_REQUEST" "$MEM_USAGE" "$MEM_REQUEST" "$OPTIMIZATION"
done
echo ""
echo "Cost Optimization Recommendations:"
echo "==================================="
# Calculate total cluster costs (example with AWS EKS pricing)
NODE_COUNT=$(kubectl get nodes --no-headers | wc -l)
NODE_TYPE="t3.large" # Adjust to your instance type
COST_PER_NODE_HOUR=0.0832 # t3.large on-demand price
HOURS_PER_MONTH=730
MONTHLY_COST=$(echo "$NODE_COUNT * $COST_PER_NODE_HOUR * $HOURS_PER_MONTH" | bc)
echo "Current cluster costs:"
echo " Nodes: $NODE_COUNT x $NODE_TYPE"
echo " Estimated monthly cost: \$$MONTHLY_COST"
echo ""
# Recommendations
echo "1. Consider using Spot Instances for non-critical workloads"
echo " Potential savings: 60-90%"
echo ""
echo "2. Enable Cluster Autoscaler to scale nodes based on demand"
echo " Average savings: 30-40%"
echo ""
echo "3. Use Horizontal Pod Autoscaler for application scaling"
echo " Prevents over-provisioning"
echo ""
echo "4. Implement PodDisruptionBudgets for safe scaling"
echo " Maintains availability during scale-down"
2. Implement Horizontal Pod Autoscaler
# kubernetes/autoscaling/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: laravel-hpa
namespace: ecommerce
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: laravel-blue
minReplicas: 2
maxReplicas: 10
metrics:
# Scale based on CPU usage
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Scale based on memory usage
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Scale based on custom metric (requests per second)
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Min
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 30
- type: Pods
value: 4
periodSeconds: 30
selectPolicy: Max
---
# Worker HPA - different scaling characteristics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: laravel-worker-hpa
namespace: ecommerce
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: laravel-worker
minReplicas: 1
maxReplicas: 5
metrics:
# Scale based on queue length
- type: External
external:
metric:
name: redis_queue_length
selector:
matchLabels:
queue: default
target:
type: AverageValue
averageValue: "100"
behavior:
scaleDown:
stabilizationWindowSeconds: 600 # Wait 10 minutes before scaling down
policies:
- type: Pods
value: 1
periodSeconds: 120
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 60
Apply and verify HPA:
kubectl apply -f kubernetes/autoscaling/hpa.yaml
# Verify HPA is working
kubectl get hpa -n ecommerce
# Output:
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
# laravel-hpa Deployment/laravel-blue 45%/70%, 52%/80% 2 10 3
# laravel-worker Deployment/laravel-worker 73/100 1 5 2
# Watch HPA in action
kubectl get hpa -n ecommerce --watch
3. Cluster Autoscaler for Node Management
# kubernetes/autoscaling/cluster-autoscaler.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-autoscaler
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
resources: ["events", "endpoints"]
verbs: ["create", "patch"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
- apiGroups: [""]
resources: ["pods/status"]
verbs: ["update"]
- apiGroups: [""]
resources: ["endpoints"]
resourceNames: ["cluster-autoscaler"]
verbs: ["get", "update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["watch", "list", "get", "update"]
- apiGroups: [""]
resources:
- "namespaces"
- "pods"
- "services"
- "replicationcontrollers"
- "persistentvolumeclaims"
- "persistentvolumes"
verbs: ["watch", "list", "get"]
- apiGroups: ["extensions"]
resources: ["replicasets", "daemonsets"]
verbs: ["watch", "list", "get"]
- apiGroups: ["policy"]
resources: ["poddisruptionbudgets"]
verbs: ["watch", "list"]
- apiGroups: ["apps"]
resources: ["statefulsets", "replicasets", "daemonsets"]
verbs: ["watch", "list", "get"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
verbs: ["watch", "list", "get"]
- apiGroups: ["batch", "extensions"]
resources: ["jobs"]
verbs: ["get", "list", "watch", "patch"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["create"]
- apiGroups: ["coordination.k8s.io"]
resourceNames: ["cluster-autoscaler"]
resources: ["leases"]
verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["create","list","watch"]
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["cluster-autoscaler-status", "cluster-autoscaler-priority-expander"]
verbs: ["delete", "get", "update", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cluster-autoscaler
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
app: cluster-autoscaler
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
priorityClassName: system-cluster-critical
securityContext:
runAsNonRoot: true
runAsUser: 65534
fsGroup: 65534
serviceAccountName: cluster-autoscaler
containers:
- image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.0
name: cluster-autoscaler
resources:
limits:
cpu: 100m
memory: 600Mi
requests:
cpu: 100m
memory: 600Mi
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/production-cluster
- --balance-similar-node-groups
- --skip-nodes-with-system-pods=false
- --scale-down-unneeded-time=10m
- --scale-down-delay-after-add=10m
volumeMounts:
- name: ssl-certs
mountPath: /etc/ssl/certs/ca-certificates.crt
readOnly: true
imagePullPolicy: "Always"
volumes:
- name: ssl-certs
hostPath:
path: "/etc/ssl/certs/ca-bundle.crt"
Performance Monitoring & SLO Tracking
Service Level Objectives (SLOs) define acceptable performance. Track them rigorously.
Define SLOs for E-Commerce Platform
# monitoring/slo-definitions.yaml
# Service Level Objectives for Laravel E-Commerce Platform
#
# SLI (Service Level Indicator): What you measure
# SLO (Service Level Objective): Target value
# SLA (Service Level Agreement): What you promise customers
slos:
# Availability SLO: 99.9% uptime (43 minutes downtime per month)
- name: "availability"
target: 0.999
window: "30d"
sli:
query: |
sum(rate(http_requests_total{namespace="ecommerce",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{namespace="ecommerce"}[5m]))
error_budget: 0.001 # 0.1% = 43 minutes per month
# Latency SLO: 95% of requests under 500ms
- name: "latency_p95"
target: 0.5 # seconds
percentile: 95
window: "7d"
sli:
query: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket{namespace="ecommerce"}[5m])
)
# Payment Success Rate: 99.5% of payments succeed
- name: "payment_success_rate"
target: 0.995
window: "30d"
sli:
query: |
sum(rate(payment_transactions_total{status="success"}[5m]))
/
sum(rate(payment_transactions_total[5m]))
error_budget: 0.005
# Order Processing Time: 95% of orders processed within 30 seconds
- name: "order_processing_time_p95"
target: 30 # seconds
percentile: 95
window: "7d"
sli:
query: |
histogram_quantile(0.95,
rate(order_processing_duration_seconds_bucket[5m])
)
Implement SLO tracking in Laravel:
<?php
// app/Services/SLOTracker.php
namespace App\Services;
use Illuminate\Support\Facades\Redis;
use Illuminate\Support\Facades\Log;
class SLOTracker
{
/**
* Track availability SLI
*/
public function trackRequest(string $status_code, float $duration): void
{
$timestamp = now()->timestamp;
$success = !str_starts_with($status_code, '5');
// Store in time-series for SLO calculation
Redis::zadd('slo:availability:requests', $timestamp, json_encode([
'timestamp' => $timestamp,
'success' => $success,
'status' => $status_code,
]));
// Store latency measurement
Redis::zadd('slo:latency:measurements', $timestamp, $duration);
// Keep only last 30 days of data
$cutoff = now()->subDays(30)->timestamp;
Redis::zremrangebyscore('slo:availability:requests', '-inf', $cutoff);
Redis::zremrangebyscore('slo:latency:measurements', '-inf', $cutoff);
// Check if we're burning error budget too fast
$this->checkErrorBudgetBurn();
}
/**
* Track payment transaction outcome
*/
public function trackPayment(bool $success, float $amount): void
{
$timestamp = now()->timestamp;
Redis::zadd('slo:payments:transactions', $timestamp, json_encode([
'timestamp' => $timestamp,
'success' => $success,
'amount' => $amount,
]));
// Alert if payment success rate drops below threshold
if (!$success) {
$recent_failure_rate = $this->calculateRecentPaymentFailureRate();
if ($recent_failure_rate > 0.02) { // 2% failure rate
$this->alertHighPaymentFailureRate($recent_failure_rate);
}
}
}
/**
* Calculate current SLO compliance
*/
public function calculateSLOCompliance(string $slo_name, int $window_days = 30): array
{
$cutoff = now()->subDays($window_days)->timestamp;
switch ($slo_name) {
case 'availability':
return $this->calculateAvailabilitySLO($cutoff);
case 'latency':
return $this->calculateLatencySLO($cutoff);
case 'payment_success':
return $this->calculatePaymentSuccessSLO($cutoff);
default:
throw new \InvalidArgumentException("Unknown SLO: $slo_name");
}
}
/**
* Calculate availability SLO
*/
private function calculateAvailabilitySLO(int $cutoff): array
{
$requests = Redis::zrangebyscore('slo:availability:requests', $cutoff, '+inf');
$total = count($requests);
$successful = 0;
foreach ($requests as $request_json) {
$request = json_decode($request_json, true);
if ($request['success']) {
$successful++;
}
}
$availability = $total > 0 ? $successful / $total : 1.0;
$target = 0.999;
$error_budget_remaining = 1 - (($target - $availability) / (1 - $target));
return [
'slo_name' => 'availability',
'target' => $target,
'actual' => $availability,
'compliant' => $availability >= $target,
'error_budget_remaining' => max(0, $error_budget_remaining),
'total_requests' => $total,
'successful_requests' => $successful,
];
}
/**
* Calculate latency SLO (P95)
*/
private function calculateLatencySLO(int $cutoff): array
{
$measurements = Redis::zrangebyscore('slo:latency:measurements', $cutoff, '+inf');
if (empty($measurements)) {
return [
'slo_name' => 'latency_p95',
'target' => 0.5,
'actual' => 0,
'compliant' => true,
'sample_count' => 0,
];
}
sort($measurements, SORT_NUMERIC);
$p95_index = (int) (count($measurements) * 0.95);
$p95_latency = $measurements[$p95_index];
$target = 0.5; // 500ms
return [
'slo_name' => 'latency_p95',
'target' => $target,
'actual' => $p95_latency,
'compliant' => $p95_latency <= $target,
'sample_count' => count($measurements),
];
}
/**
* Check if we're burning error budget too quickly
* This detects rapid degradation before we violate SLO
*/
private function checkErrorBudgetBurn(): void
{
// Look at last hour
$one_hour_ago = now()->subHour()->timestamp;
$requests = Redis::zrangebyscore('slo:availability:requests', $one_hour_ago, '+inf');
if (count($requests) < 100) {
return; // Not enough data
}
$failures = 0;
foreach ($requests as $request_json) {
$request = json_decode($request_json, true);
if (!$request['success']) {
$failures++;
}
}
$error_rate = $failures / count($requests);
// If error rate in last hour > 1%, we're burning budget 10x faster than sustainable
if ($error_rate > 0.01) {
Log::critical('Rapid error budget burn detected', [
'error_rate' => $error_rate,
'failures_last_hour' => $failures,
'requests_last_hour' => count($requests),
'burn_rate' => $error_rate / 0.001, // Relative to monthly budget
]);
// Trigger P1 alert
$this->alertErrorBudgetBurn($error_rate);
}
}
/**
* Calculate recent payment failure rate (last 5 minutes)
*/
private function calculateRecentPaymentFailureRate(): float
{
$five_min_ago = now()->subMinutes(5)->timestamp;
$transactions = Redis::zrangebyscore('slo:payments:transactions', $five_min_ago, '+inf');
if (empty($transactions)) {
return 0.0;
}
$failures = 0;
foreach ($transactions as $transaction_json) {
$transaction = json_decode($transaction_json, true);
if (!$transaction['success']) {
$failures++;
}
}
return $failures / count($transactions);
}
/**
* Alert on rapid error budget consumption
*/
private function alertErrorBudgetBurn(float $error_rate): void
{
// Integration with your alerting system
// This would trigger PagerDuty, Slack, etc.
event(new \App\Events\ErrorBudgetBurnAlert([
'error_rate' => $error_rate,
'severity' => 'high',
'message' => "Error budget burning at {$error_rate}% rate",
]));
}
/**
* Alert on high payment failure rate
*/
private function alertHighPaymentFailureRate(float $failure_rate): void
{
event(new \App\Events\PaymentFailureRateAlert([
'failure_rate' => $failure_rate,
'severity' => 'critical',
'message' => "Payment failure rate at {$failure_rate}%",
]));
}
}
Disaster Recovery & Business Continuity
Hope is not a strategy. Test your disaster recovery plan before you need it.
Automated Database Backup System
# kubernetes/cronjobs/database-backup.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: database-backup
namespace: ecommerce
spec:
# Run every 6 hours
schedule: "0 */6 * * *"
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: mysql:8.0
env:
- name: DB_HOST
valueFrom:
secretKeyRef:
name: laravel-secrets
key: db-host
- name: DB_USER
valueFrom:
secretKeyRef:
name: laravel-secrets
key: db-user
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: laravel-secrets
key: db-password
- name: DB_NAME
value: "ecommerce"
- name: S3_BUCKET
value: "ecommerce-backups"
- name: AWS_REGION
value: "us-east-1"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: s3-credentials
key: access-key-id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: s3-credentials
key: secret-access-key
command:
- /bin/bash
- -c
- |
set -e
BACKUP_DATE=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="backup-${BACKUP_DATE}.sql.gz"
echo "Starting database backup at $(date)"
# Create backup with mysqldump
mysqldump \
--host="$DB_HOST" \
--user="$DB_USER" \
--password="$DB_PASSWORD" \
--single-transaction \
--quick \
--lock-tables=false \
--routines \
--triggers \
--events \
"$DB_NAME" | gzip > "/tmp/$BACKUP_FILE"
# Verify backup file exists and is not empty
if [ ! -s "/tmp/$BACKUP_FILE" ]; then
echo "ERROR: Backup file is empty or does not exist"
exit 1
fi
BACKUP_SIZE=$(du -h "/tmp/$BACKUP_FILE" | cut -f1)
echo "Backup created: $BACKUP_FILE (size: $BACKUP_SIZE)"
# Upload to S3
apt-get update && apt-get install -y awscli
aws s3 cp "/tmp/$BACKUP_FILE" \
"s3://${S3_BUCKET}/mysql/${BACKUP_FILE}" \
--region "$AWS_REGION"
echo "Backup uploaded to S3: s3://${S3_BUCKET}/mysql/${BACKUP_FILE}"
# Verify upload
aws s3 ls "s3://${S3_BUCKET}/mysql/${BACKUP_FILE}" --region "$AWS_REGION"
# Delete local backup
rm "/tmp/$BACKUP_FILE"
# Delete backups older than 30 days
echo "Cleaning up old backups..."
CUTOFF_DATE=$(date -d '30 days ago' +%Y%m%d)
aws s3 ls "s3://${S3_BUCKET}/mysql/" --region "$AWS_REGION" | \
while read -r line; do
FILE_DATE=$(echo $line | awk '{print $4}' | grep -oP 'backup-\K[0-9]{8}')
if [ ! -z "$FILE_DATE" ] && [ "$FILE_DATE" -lt "$CUTOFF_DATE" ]; then
FILE_NAME=$(echo $line | awk '{print $4}')
echo "Deleting old backup: $FILE_NAME"
aws s3 rm "s3://${S3_BUCKET}/mysql/${FILE_NAME}" --region "$AWS_REGION"
fi
done
echo "Backup completed successfully at $(date)"
Disaster recovery restoration script:
#!/bin/bash
# scripts/restore-from-backup.sh
#
# Restore database from S3 backup
#
# Usage: ./scripts/restore-from-backup.sh <backup-file-name>
# Example: ./scripts/restore-from-backup.sh backup-20240315-120000.sql.gz
set -e
BACKUP_FILE=$1
if [[ -z "$BACKUP_FILE" ]]; then
echo "Usage: $0 <backup-file-name>"
echo ""
echo "Available backups:"
aws s3 ls s3://ecommerce-backups/mysql/ | grep backup- | awk '{print $4}'
exit 1
fi
echo "=========================================="
echo "DATABASE DISASTER RECOVERY"
echo "=========================================="
echo ""
echo "WARNING: This will REPLACE the current database with backup from:"
echo " $BACKUP_FILE"
echo ""
read -p "Are you absolutely sure? Type 'RESTORE' to continue: " CONFIRM
if [[ "$CONFIRM" != "RESTORE" ]]; then
echo "Restoration cancelled"
exit 1
fi
# Download backup from S3
echo "Downloading backup from S3..."
aws s3 cp "s3://ecommerce-backups/mysql/$BACKUP_FILE" "/tmp/$BACKUP_FILE"
# Verify download
if [ ! -f "/tmp/$BACKUP_FILE" ]; then
echo "ERROR: Failed to download backup file"
exit 1
fi
echo "Backup downloaded: $(du -h /tmp/$BACKUP_FILE | cut -f1)"
# Get database credentials
DB_POD=$(kubectl get pod -n ecommerce -l app=mysql -o jsonpath='{.items[0].metadata.name}')
DB_HOST=$(kubectl get secret laravel-secrets -n ecommerce -o jsonpath="{.data.db-host}" | base64 -d)
DB_USER=$(kubectl get secret laravel-secrets -n ecommerce -o jsonpath="{.data.db-user}" | base64 -d)
DB_PASS=$(kubectl get secret laravel-secrets -n ecommerce -o jsonpath="{.data.db-password}" | base64 -d)
DB_NAME="ecommerce"
# Create backup of current database before restoration
echo "Creating safety backup of current database..."
kubectl exec -n ecommerce $DB_POD -- mysqldump \
-u"$DB_USER" -p"$DB_PASS" "$DB_NAME" | gzip > "/tmp/pre-restore-backup-$(date +%Y%m%d-%H%M%S).sql.gz"
# Copy backup file to database pod
echo "Copying backup to database pod..."
kubectl cp "/tmp/$BACKUP_FILE" "ecommerce/$DB_POD:/tmp/$BACKUP_FILE"
# Restore database
echo "Restoring database..."
kubectl exec -n ecommerce $DB_POD -- bash -c "
set -e
echo 'Dropping existing database...'
mysql -u$DB_USER -p$DB_PASS -e 'DROP DATABASE IF EXISTS $DB_NAME'
echo 'Creating fresh database...'
mysql -u$DB_USER -p$DB_PASS -e 'CREATE DATABASE $DB_NAME CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci'
echo 'Restoring from backup...'
gunzip < /tmp/$BACKUP_FILE | mysql -u$DB_USER -p$DB_PASS $DB_NAME
echo 'Restoration complete'
rm /tmp/$BACKUP_FILE
"
# Verify restoration
echo "Verifying restoration..."
TABLE_COUNT=$(kubectl exec -n ecommerce $DB_POD -- mysql -u"$DB_USER" -p"$DB_PASS" "$DB_NAME" -e "SHOW TABLES" | wc -l)
echo "Restoration complete!"
echo " Tables in database: $TABLE_COUNT"
echo ""
echo "Next steps:"
echo "1. Verify application functionality"
echo "2. Run: kubectl rollout restart deployment/laravel-blue -n ecommerce"
echo "3. Monitor error rates in Grafana"
echo ""
echo "Pre-restoration backup saved to: /tmp/pre-restore-backup-*.sql.gz"
Lessons Learned from Production Outages
Real failures teach better than any tutorial. Here are expensive lessons from production incidents.
Lesson 1: Connection Pool Exhaustion (Black Friday 2023)
What happened: At 9 PM on Black Friday, response times spiked to 45 seconds. Error rate hit 23%. Database connections maxed out at 100, but application needed 300+.
Root cause: Laravel's default database connection pooling doesn't account for concurrent workers. Each PHP-FPM worker opens its own connections.
The fix:
<?php
// config/database.php - Proper connection management
'mysql' => [
'driver' => 'mysql',
'host' => env('DB_HOST', '127.0.0.1'),
'port' => env('DB_PORT', '3306'),
'database' => env('DB_DATABASE', 'forge'),
'username' => env('DB_USERNAME', 'forge'),
'password' => env('DB_PASSWORD', ''),
'unix_socket' => env('DB_SOCKET', ''),
'charset' => 'utf8mb4',
'collation' => 'utf8mb4_unicode_ci',
'prefix' => '',
'prefix_indexes' => true,
'strict' => true,
'engine' => null,
'options' => extension_loaded('pdo_mysql') ? array_filter([
// Critical: Set connection timeout
PDO::ATTR_TIMEOUT => 3,
// Critical: Enable persistent connections to reuse TCP connections
PDO::ATTR_PERSISTENT => env('DB_PERSISTENT', false),
// Set reasonable wait_timeout on MySQL side
PDO::MYSQL_ATTR_INIT_COMMAND => "SET SESSION wait_timeout=600",
]) : [],
// Critical: Pool configuration
'pool' => [
'min_connections' => env('DB_POOL_MIN', 2),
'max_connections' => env('DB_POOL_MAX', 10),
'connect_timeout' => 3.0,
'wait_timeout' => 600,
'idle_timeout' => 60,
'max_lifetime' => 3600,
],
],
Monitor connection usage:
<?php
// app/Console/Commands/MonitorDatabaseConnections.php
namespace App\Console\Commands;
use Illuminate\Console\Command;
use Illuminate\Support\Facades\DB;
class MonitorDatabaseConnections extends Command
{
protected $signature = 'db:monitor-connections';
protected $description = 'Monitor database connection usage';
public function handle()
{
while (true) {
$connections = DB::select("SHOW PROCESSLIST");
$active = count(array_filter($connections, fn($c) => $c->Command !== 'Sleep'));
$total = count($connections);
$this->info(sprintf(
'[%s] Connections: %d active / %d total',
now()->toDateTimeString(),
$active,
$total
));
// Alert if nearing limit
if ($total > 80) {
$this->error("WARNING: High connection count!");
}
sleep(5);
}
}
}
Lesson 2: Cache Stampede During Deployment
What happened: After deploying new code, cache was cleared. 10,000+ requests simultaneously tried to rebuild the same cached product catalog. Database CPU hit 100%, response times went to 30+ seconds.
The fix - Cache warming strategy:
<?php
// app/Console/Commands/WarmCache.php
namespace App\Console\Commands;
use Illuminate\Console\Command;
use Illuminate\Support\Facades\Cache;
use App\Models\Product;
class WarmCache extends Command
{
protected $signature = 'cache:warm';
protected $description = 'Pre-warm critical caches before deployment';
public function handle()
{
$this->info('Warming critical caches...');
// Warm product catalog cache
$this->warmProductCatalog();
// Warm category tree
$this->warmCategoryTree();
// Warm popular searches
$this->warmPopularSearches();
$this->info('Cache warming complete!');
}
private function warmProductCatalog(): void
{
$this->info(' Warming product catalog...');
Cache::remember('products:featured', 3600, function() {
return Product::where('featured', true)
->with(['images', 'categories'])
->get();
});
Cache::remember('products:bestsellers', 3600, function() {
return Product::orderBy('sales_count', 'desc')
->limit(50)
->with(['images', 'categories'])
->get();
});
}
private function warmCategoryTree(): void
{
$this->info(' Warming category tree...');
Cache::remember('categories:tree', 3600, function() {
return \App\Models\Category::with('children')->whereNull('parent_id')->get();
});
}
private function warmPopularSearches(): void
{
$this->info(' Warming popular search queries...');
$popular_queries = ['laptop', 'phone', 'headphones', 'camera'];
foreach ($popular_queries as $query) {
Cache::remember("search:$query", 1800, function() use ($query) {
return Product::search($query)->take(20)->get();
});
}
}
}
Run cache warming BEFORE traffic switches:
# In your deployment script, before switching blue-green
kubectl exec -n ecommerce deployment/laravel-green -- php artisan cache:warm
# Then switch traffic
kubectl patch service laravel-service -n ecommerce \
-p '{"spec":{"selector":{"environment":"green"}}}'
Lesson 3: Silent Data Corruption from Race Condition
What happened: Customer reported double-charging. Investigation found race condition in payment processing. Two requests for same order ID both succeeded because database check happened before insert.
The fix - Idempotency keys and database constraints:
<?php
// database/migrations/2024_03_15_add_idempotency_to_payments.php
use Illuminate\Database\Migrations\Migration;
use Illuminate\Database\Schema\Blueprint;
use Illuminate\Support\Facades\Schema;
return new class extends Migration
{
public function up()
{
Schema::table('payments', function (Blueprint $table) {
// Add idempotency key column
$table->string('idempotency_key', 64)->nullable()->after('id');
// Create unique index to prevent duplicate processing
$table->unique('idempotency_key', 'payments_idempotency_key_unique');
// Add index for fast lookups
$table->index(['order_id', 'status'], 'payments_order_status_index');
});
// Ensure order_id + status combination is unique for successful payments
DB::statement('
CREATE UNIQUE INDEX payments_order_success_unique
ON payments (order_id)
WHERE status = "success"
');
}
public function down()
{
DB::statement('DROP INDEX payments_order_success_unique ON payments');
Schema::table('payments', function (Blueprint $table) {
$table->dropUnique('payments_idempotency_key_unique');
$table->dropIndex('payments_order_status_index');
$table->dropColumn('idempotency_key');
});
}
};
<?php
// app/Services/PaymentService.php - Idempotent payment processing
namespace App\Services;
use App\Models\Payment;
use Illuminate\Support\Facades\DB;
use Illuminate\Support\Str;
class PaymentService
{
public function processPayment(string $order_id, float $amount, string $idempotency_key = null): Payment
{
// Generate idempotency key if not provided
$idempotency_key = $idempotency_key ?? Str::uuid()->toString();
// Check if this payment was already processed
$existing = Payment::where('idempotency_key', $idempotency_key)->first();
if ($existing) {
\Log::info('Payment already processed, returning existing', [
'idempotency_key' => $idempotency_key,
'payment_id' => $existing->id,
]);
return $existing;
}
// Use database transaction with explicit locking
return DB::transaction(function() use ($order_id, $amount, $idempotency_key) {
// Check if order already has successful payment (with row lock)
$existing_success = Payment::where('order_id', $order_id)
->where('status', 'success')
->lockForUpdate() // Critical: Lock row to prevent race condition
->first();
if ($existing_success) {
throw new \Exception("Order {$order_id} already has successful payment");
}
// Create payment record FIRST (before calling Stripe)
// This ensures we have database constraint protection
try {
$payment = Payment::create([
'idempotency_key' => $idempotency_key,
'order_id' => $order_id,
'amount' => $amount,
'status' => 'pending',
]);
} catch (\Illuminate\Database\QueryException $e) {
// Duplicate idempotency key - payment already processing
if ($e->getCode() === '23000') { // Integrity constraint violation
\Log::warning('Duplicate payment attempt blocked', [
'order_id' => $order_id,
'idempotency_key' => $idempotency_key,
]);
// Return existing payment
return Payment::where('idempotency_key', $idempotency_key)->first();
}
throw $e;
}
// Now process with Stripe
try {
$stripe_payment = \Stripe\PaymentIntent::create([
'amount' => $amount * 100,
'currency' => 'usd',
'metadata' => [
'order_id' => $order_id,
'payment_id' => $payment->id,
],
], [
'idempotency_key' => $idempotency_key, // Stripe also supports idempotency
]);
$payment->update([
'stripe_payment_intent_id' => $stripe_payment->id,
'status' => 'success',
'completed_at' => now(),
]);
return $payment;
} catch (\Exception $e) {
$payment->update([
'status' => 'failed',
'error_message' => $e->getMessage(),
]);
throw $e;
}
}, 5); // 5 retry attempts for deadlocks
}
}
Beyond This Series: Advanced Topics
We've built a production-grade e-commerce platform. Here's what comes next as you scale.
1. Multi-Region Deployment
When your platform grows globally, single-region deployment isn't enough:
- Latency: US users experience 200ms+ latency to EU-hosted services
- Compliance: GDPR requires EU data stays in EU
- Availability: Regional AWS outages happen (us-east-1 outage of December 2021)
Next steps:
- Set up Kubernetes clusters in multiple regions (us-east-1, eu-west-1, ap-southeast-1)
- Implement global load balancing with Route53 or CloudFlare
- Use MySQL read replicas or multi-region databases like Amazon Aurora Global
- Implement distributed caching with Redis Cluster across regions
2. Advanced Observability
Production monitoring we've covered is foundational. Advanced observability includes:
- Distributed tracing with context propagation across microservices
- Real User Monitoring (RUM) to measure actual user experience
- Synthetic monitoring to catch issues before users do
- Cost attribution per feature/customer using Kubecost
Tools to explore:
- OpenTelemetry for standardized observability
- Honeycomb.io for high-cardinality observability
- Lightstep for service mesh observability
3. Advanced Security Hardening
Security is never finished:
- Runtime security with Falco: Detect unexpected behavior in containers
- Image scanning: Integrate Trivy or Snyk into CI/CD
- Network policies: Restrict pod-to-pod communication
- Secrets management: Migrate from Kubernetes Secrets to HashiCorp Vault
- mTLS between services: Use service mesh like Istio or Linkerd
4. Cost Optimization at Scale
At scale, cost optimization becomes critical:
- Reserved instances / Savings Plans: 40-60% savings for predictable workloads
- Spot instances for stateless workloads: 70-90% savings for batch jobs
- Database query optimization: Poorly optimized queries cost thousands monthly
- CDN usage: Offload static assets to reduce compute costs
- Right-sizing based on actual usage patterns
5. Chaos Engineering
Test your disaster recovery before disaster happens:
- Pod deletion: Random pod failures using Chaos Mesh
- Network latency injection: Simulate slow connections
- Resource exhaustion: Test behavior under memory/CPU pressure
- Dependency failures: Inject failures in Redis, database, external APIs
- Game Days: Scheduled failure testing with full team participation
Start here:
# Install Chaos Mesh
curl -sSL https://mirrors.chaos-mesh.org/latest/install.sh | bash
# Create a simple chaos experiment
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure-test
namespace: ecommerce
spec:
action: pod-failure
mode: one
duration: "60s"
selector:
namespaces:
- ecommerce
labelSelectors:
app: laravel
EOF
Final Thoughts
We've covered eight parts and built a production-grade e-commerce platform:
Part 1: Domain-Driven Design foundation
Part 2: Stripe payment integration with webhooks
Part 3: Event-driven architecture with RabbitMQ
Part 4: Background job processing and queues
Part 5: Docker containerization
Part 6: Kubernetes orchestration
Part 7: CI/CD pipeline automation
Part 8: Production deployment and operations
What makes this production-grade:
- ✅ Comprehensive health checks and observability
- ✅ Blue-green deployment for zero-downtime releases
- ✅ Automated backups with tested restore procedures
- ✅ Incident response runbooks for 2 AM outages
- ✅ SLO tracking and error budget monitoring
- ✅ Cost optimization and autoscaling
- ✅ Lessons learned from real production failures
Your platform is now production-ready, but remember: production readiness is not a destination. It's continuous improvement based on real operational data.
Keep learning:
- Monitor your SLOs religiously
- Conduct post-mortems after every incident
- Test disaster recovery quarterly
- Review costs monthly
- Update runbooks as systems evolve
The code is at https://github.com/iBekzod/laravel-ecommerce — production-tested patterns you can use today.
Questions or war stories from your production deployments? I'd love to hear them at https://nextgenbeing.com.
Now go build something that scales. 🚀
Daniel Hartwell
AuthorSenior backend engineer focused on distributed systems and database performance. Previously at fintech and SaaS scale-ups. Writes about the boring-but-critical infrastructure that keeps systems running.
Never Miss an Article
Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.
Comments (0)
Please log in to leave a comment.
Log InRelated Articles
Building a Modern SaaS Application with Laravel - Part 3: Advanced Features & Configuration
Apr 25, 2026
Building a Modern SaaS Application with Laravel - Part 1: Architecture, Setup & Foundations
Apr 25, 2026
Optimizing Database Performance with Indexing and Caching: What We Learned Scaling to 100M Queries/Day
Apr 18, 2026