Building a Production-Grade E-Commerce Platform with Laravel 12, Stripe, and Kubernetes - Part 8: Production Deployment & Monitoring

Estimated reading time: 35 minutes

Introduction: The Final Mile
Pre-Deployment Production Readiness Checklist
Blue-Green Deployment Strategy
Production Observability Stack
Incident Response Runbook
Cost Optimization in Production
Performance Monitoring & SLO Tracking
Disaster Recovery & Business Continuity
Lessons Learned from Production Outages
Beyond This Series: Advanced Topics

Introduction: The Final Mile

We've built an e-commerce platform over seven parts—from domain-driven design through payment processing, event-driven architecture, and Kubernetes orchestration. Now comes the most critical phase: production deployment and operational excellence.

The reality: 80% of software project failures happen after initial deployment. Not from bugs in the code, but from operational blind spots. I've participated in three major e-commerce platform launches, and each taught expensive lessons. A database connection pool exhaustion at 3 AM during Black Friday. A memory leak that only manifested after 72 hours of uptime. A backup system that "worked" in testing but failed during an actual disaster.

This final part covers what separates projects that survive first contact with production traffic from those that don't. We'll implement real monitoring, establish incident response procedures, and build the operational foundation your platform needs to scale from launch day through the first million orders.

What we're deploying:

┌─────────────────────────────────────────────────────────────┐
│                     Production Environment                   │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌──────────────┐    ┌──────────────┐   ┌──────────────┐  │
│  │   Blue       │    │   Grafana    │   │   Loki       │  │
│  │ Environment  │◄───│  Monitoring  │◄──│   Logs       │  │
│  │  (Current)   │    │   Stack      │   │              │  │
│  └──────────────┘    └──────────────┘   └──────────────┘  │
│         │                    │                   │          │
│         │                    ▼                   │          │
│  ┌──────────────┐    ┌──────────────┐   ┌──────────────┐  │
│  │   Green      │    │  Prometheus  │   │  Jaeger      │  │
│  │ Environment  │    │   Metrics    │   │  Tracing     │  │
│  │   (Staged)   │    │              │   │              │  │
│  └──────────────┘    └──────────────┘   └──────────────┘  │
│         │                                                    │
│         ▼                                                    │
│  ┌─────────────────────────────────────────────────────┐  │
│  │         Automated Backup & DR System                 │  │
│  │  • Database snapshots every 6 hours                  │  │
│  │  • Cross-region replication                          │  │
│  │  • Point-in-time recovery capability                 │  │
│  └─────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Critical dependencies:

# Monitoring & Observability
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts

# Update your local chart index
helm repo update

Pre-Deployment Production Readiness Checklist

Before touching production, verify every system is instrumented and validated. This checklist prevented three near-disasters in my last deployment.

1. Health Check Endpoints

Why this matters: Kubernetes uses these to determine pod health. Get them wrong, and K8s will kill healthy pods or keep broken ones running.

<?php
// app/Http/Controllers/HealthController.php

namespace App\Http\Controllers;

use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Redis;
use Illuminate\Support\Facades\Cache;
use Illuminate\Http\JsonResponse;
use Illuminate\Http\Response;

class HealthController extends Controller
{
    /**
     * Liveness probe - "Is the application running?"
     * Should return 200 even if dependencies are down.
     * Used by Kubernetes to decide if container needs restart.
     * 
     * Failure threshold: 3 consecutive failures = pod restart
     */
    public function liveness(): JsonResponse
    {
        // Only check if PHP-FPM is responsive
        // Do NOT check database, Redis, or external dependencies
        return response()->json([
            'status' => 'alive',
            'timestamp' => now()->toIso8601String(),
            'uptime_seconds' => (int) shell_exec('cut -d. -f1 /proc/uptime'),
            'memory_usage_mb' => round(memory_get_usage(true) / 1024 / 1024, 2),
        ]);
    }

    /**
     * Readiness probe - "Can the application handle traffic?"
     * Should verify all critical dependencies.
     * Used by Kubernetes to route traffic to this pod.
     * 
     * Failure threshold: 3 consecutive failures = removed from load balancer
     */
    public function readiness(): JsonResponse
    {
        $checks = [];
        $overall_healthy = true;
        $start_time = microtime(true);

        // Database connectivity check
        try {
            DB::connection()->getPdo();
            $checks['database'] = [
                'healthy' => true,
                'latency_ms' => round((microtime(true) - $start_time) * 1000, 2),
            ];
        } catch (\Exception $e) {
            $checks['database'] = [
                'healthy' => false,
                'error' => $e->getMessage(),
            ];
            $overall_healthy = false;
        }

        // Redis connectivity check
        $redis_start = microtime(true);
        try {
            Redis::ping();
            $checks['redis'] = [
                'healthy' => true,
                'latency_ms' => round((microtime(true) - $redis_start) * 1000, 2),
            ];
        } catch (\Exception $e) {
            $checks['redis'] = [
                'healthy' => false,
                'error' => $e->getMessage(),
            ];
            $overall_healthy = false;
        }

        // Storage connectivity check (S3 or equivalent)
        $storage_start = microtime(true);
        try {
            \Storage::disk('s3')->exists('.health-check');
            $checks['storage'] = [
                'healthy' => true,
                'latency_ms' => round((microtime(true) - $storage_start) * 1000, 2),
            ];
        } catch (\Exception $e) {
            $checks['storage'] = [
                'healthy' => false,
                'error' => $e->getMessage(),
            ];
            // Storage issues shouldn't prevent traffic routing
            // Just log for monitoring
            \Log::warning('Storage health check failed', [
                'error' => $e->getMessage(),
            ]);
        }

        // Queue connectivity check
        $queue_start = microtime(true);
        try {
            // Attempt to get queue size without processing jobs
            $queue_size = Redis::llen('queues:default');
            $checks['queue'] = [
                'healthy' => true,
                'pending_jobs' => $queue_size,
                'latency_ms' => round((microtime(true) - $queue_start) * 1000, 2),
            ];
        } catch (\Exception $e) {
            $checks['queue'] = [
                'healthy' => false,
                'error' => $e->getMessage(),
            ];
            $overall_healthy = false;
        }

        $status_code = $overall_healthy ? Response::HTTP_OK : Response::HTTP_SERVICE_UNAVAILABLE;

        return response()->json([
            'status' => $overall_healthy ? 'ready' : 'not_ready',
            'checks' => $checks,
            'timestamp' => now()->toIso8601String(),
            'total_check_time_ms' => round((microtime(true) - $start_time) * 1000, 2),
        ], $status_code);
    }

    /**
     * Startup probe - "Has the application finished initializing?"
     * Critical for slow-starting applications (large caches, migrations, etc).
     * 
     * Failure threshold: 30 failures (at 10s intervals) = 5 minutes before restart
     */
    public function startup(): JsonResponse
    {
        // Check if application has completed critical initialization
        $initialized = true;
        $initialization_checks = [];

        // Verify config cache exists (production requirement)
        if (!file_exists(base_path('bootstrap/cache/config.php'))) {
            $initialized = false;
            $initialization_checks['config_cache'] = false;
        } else {
            $initialization_checks['config_cache'] = true;
        }

        // Verify route cache exists (production requirement)
        if (!file_exists(base_path('bootstrap/cache/routes-v7.php'))) {
            $initialized = false;
            $initialization_checks['route_cache'] = false;
        } else {
            $initialization_checks['route_cache'] = true;
        }

        // Verify database migrations are current
        try {
            $pending_migrations = DB::table('migrations')->count();
            $initialization_checks['migrations'] = $pending_migrations > 0;
        } catch (\Exception $e) {
            $initialized = false;
            $initialization_checks['migrations'] = false;
        }

        $status_code = $initialized ? Response::HTTP_OK : Response::HTTP_SERVICE_UNAVAILABLE;

        return response()->json([
            'status' => $initialized ? 'initialized' : 'initializing',
            'checks' => $initialization_checks,
            'timestamp' => now()->toIso8601String(),
        ], $status_code);
    }

    /**
     * Detailed metrics endpoint for Prometheus scraping
     * This endpoint should NOT be exposed publicly
     */
    public function metrics(): Response
    {
        $metrics = [];

        // Application metrics
        $metrics[] = sprintf('# HELP app_uptime_seconds Application uptime in seconds');
        $metrics[] = sprintf('# TYPE app_uptime_seconds gauge');
        $metrics[] = sprintf('app_uptime_seconds %d', (int) shell_exec('cut -d. -f1 /proc/uptime'));

        // Memory metrics
        $metrics[] = sprintf('# HELP app_memory_usage_bytes Current memory usage in bytes');
        $metrics[] = sprintf('# TYPE app_memory_usage_bytes gauge');
        $metrics[] = sprintf('app_memory_usage_bytes %d', memory_get_usage(true));

        // Database connection pool metrics
        $metrics[] = sprintf('# HELP db_connections_active Active database connections');
        $metrics[] = sprintf('# TYPE db_connections_active gauge');
        try {
            $connections = DB::select("SHOW STATUS LIKE 'Threads_connected'");
            $metrics[] = sprintf('db_connections_active %d', $connections[0]->Value ?? 0);
        } catch (\Exception $e) {
            $metrics[] = sprintf('db_connections_active 0');
        }

        // Queue metrics
        $metrics[] = sprintf('# HELP queue_jobs_pending Jobs pending in queue');
        $metrics[] = sprintf('# TYPE queue_jobs_pending gauge');
        try {
            $pending = Redis::llen('queues:default');
            $metrics[] = sprintf('queue_jobs_pending %d', $pending);
        } catch (\Exception $e) {
            $metrics[] = sprintf('queue_jobs_pending 0');
        }

        // Cache hit rate (requires custom tracking)
        $metrics[] = sprintf('# HELP cache_hits_total Total cache hits');
        $metrics[] = sprintf('# TYPE cache_hits_total counter');
        $cache_hits = Cache::get('metrics:cache:hits', 0);
        $metrics[] = sprintf('cache_hits_total %d', $cache_hits);

        return response(implode("\n", $metrics), 200)
            ->header('Content-Type', 'text/plain; version=0.0.4');
    }
}

Register these routes:

<?php
// routes/web.php

// Health check endpoints - NO authentication, NO rate limiting
// These must respond quickly and reliably
Route::get('/health/live', [App\Http\Controllers\HealthController::class, 'liveness'])
    ->name('health.liveness');

Route::get('/health/ready', [App\Http\Controllers\HealthController::class, 'readiness'])
    ->name('health.readiness');

Route::get('/health/startup', [App\Http\Controllers\HealthController::class, 'startup'])
    ->name('health.startup');

// Metrics endpoint - restrict to internal network only
Route::get('/metrics', [App\Http\Controllers\HealthController::class, 'metrics'])
    ->middleware(['throttle:120,1']) // Allow high frequency scraping
    ->name('metrics');

Configure Kubernetes probes:

# kubernetes/deployments/laravel-web.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: laravel-web
  namespace: ecommerce
spec:
  replicas: 3
  selector:
    matchLabels:
      app: laravel-web
  template:
    metadata:
      labels:
        app: laravel-web
        version: v1.0.0
    spec:
      containers:
      - name: laravel
        image: ghcr.io/ibekzod/laravel-ecommerce:latest
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        
        # Startup probe - gives app up to 5 minutes to initialize
        # Critical for first deployment or after cache clearing
        startupProbe:
          httpGet:
            path: /health/startup
            port: 8080
            httpHeaders:
            - name: X-Probe-Type
              value: startup
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 30  # 30 failures * 10s = 5 minutes max startup time
        
        # Liveness probe - restarts container if unhealthy
        # Should almost never fail in production
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
            httpHeaders:
            - name: X-Probe-Type
              value: liveness
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3  # 3 failures = restart
        
        # Readiness probe - removes from load balancer if unhealthy
        # Will fail during deployment or dependency issues
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
            httpHeaders:
            - name: X-Probe-Type
              value: readiness
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 3  # 3 failures = remove from service
        
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        
        env:
        - name: APP_ENV
          value: "production"
        - name: APP_DEBUG
          value: "false"
        - name: LOG_CHANNEL
          value: "stack"
        - name: DB_CONNECTION
          value: "mysql"
        - name: CACHE_DRIVER
          value: "redis"
        - name: QUEUE_CONNECTION
          value: "redis"
        
        # Database credentials from secret
        - name: DB_HOST
          valueFrom:
            secretKeyRef:
              name: laravel-secrets
              key: db-host
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: laravel-secrets
              key: db-password

2. Production Configuration Validation Script

Why this matters: Deploying with wrong configuration is the #1 cause of production incidents. This script catches misconfigurations before deployment.

#!/bin/bash
# scripts/validate-production-config.sh
#
# Run this script before EVERY production deployment
# It validates configuration, secrets, and infrastructure readiness
#
# Usage: ./scripts/validate-production-config.sh

set -e  # Exit on any error

echo "=================================================="
echo "Production Configuration Validation"
echo "=================================================="
echo ""

# Color codes for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

VALIDATION_FAILED=0

# Function to print validation results
check_passed() {
    echo -e "${GREEN}✓${NC} $1"
}

check_failed() {
    echo -e "${RED}✗${NC} $1"
    VALIDATION_FAILED=1
}

check_warning() {
    echo -e "${YELLOW}⚠${NC} $1"
}

# 1. Verify kubectl is configured for correct cluster
echo "1. Verifying Kubernetes cluster context..."
CURRENT_CONTEXT=$(kubectl config current-context)
EXPECTED_CONTEXT="production-cluster"

if [[ "$CURRENT_CONTEXT" == "$EXPECTED_CONTEXT" ]]; then
    check_passed "Connected to production cluster: $CURRENT_CONTEXT"
else
    check_failed "Wrong cluster context. Expected: $EXPECTED_CONTEXT, Got: $CURRENT_CONTEXT"
    echo "   Switch context with: kubectl config use-context $EXPECTED_CONTEXT"
fi

# 2. Verify namespace exists
echo ""
echo "2. Verifying namespace..."
if kubectl get namespace ecommerce &> /dev/null; then
    check_passed "Namespace 'ecommerce' exists"
else
    check_failed "Namespace 'ecommerce' does not exist"
    echo "   Create with: kubectl create namespace ecommerce"
fi

# 3. Verify all required secrets exist
echo ""
echo "3. Verifying Kubernetes secrets..."
REQUIRED_SECRETS=(
    "laravel-secrets"
    "stripe-api-keys"
    "database-credentials"
    "redis-password"
    "s3-credentials"
)

for secret in "${REQUIRED_SECRETS[@]}"; do
    if kubectl get secret "$secret" -n ecommerce &> /dev/null; then
        check_passed "Secret '$secret' exists"
        
        # Verify secret has required keys
        case $secret in
            "laravel-secrets")
                REQUIRED_KEYS=("app-key" "db-host" "db-password")
                ;;
            "stripe-api-keys")
                REQUIRED_KEYS=("secret-key" "webhook-secret")
                ;;
            *)
                REQUIRED_KEYS=()
                ;;
        esac
        
        for key in "${REQUIRED_KEYS[@]}"; do
            if kubectl get secret "$secret" -n ecommerce -o jsonpath="{.data.$key}" &> /dev/null; then
                check_passed "  Key '$key' present in secret '$secret'"
            else
                check_failed "  Key '$key' missing from secret '$secret'"
            fi
        done
    else
        check_failed "Secret '$secret' not found"
    fi
done

# 4. Verify ConfigMaps
echo ""
echo "4. Verifying ConfigMaps..."
REQUIRED_CONFIGMAPS=(
    "laravel-config"
    "nginx-config"
)

for cm in "${REQUIRED_CONFIGMAPS[@]}"; do
    if kubectl get configmap "$cm" -n ecommerce &> /dev/null; then
        check_passed "ConfigMap '$cm' exists"
    else
        check_failed "ConfigMap '$cm' not found"
    fi
done

# 5. Verify database connectivity
echo ""
echo "5. Verifying database connectivity..."
DB_HOST=$(kubectl get secret laravel-secrets -n ecommerce -o jsonpath="{.data.db-host}" | base64 -d)
DB_USER=$(kubectl get secret laravel-secrets -n ecommerce -o jsonpath="{.data.db-user}" | base64 -d)
DB_PASS=$(kubectl get secret laravel-secrets -n ecommerce -o jsonpath="{.data.db-password}" | base64 -d)

# Test database connection using a temporary pod
cat <<EOF | kubectl apply -f - &> /dev/null
apiVersion: v1
kind: Pod
metadata:
  name: db-test
  namespace: ecommerce
spec:
  containers:
  - name: mysql-client
    image: mysql:8.0
    command: ['sleep', '30']
  restartPolicy: Never
EOF

sleep 3  # Wait for pod to start

if kubectl exec -n ecommerce db-test -- mysql -h"$DB_HOST" -u"$DB_USER" -p"$DB_PASS" -e "SELECT 1" &> /dev/null; then
    check_passed "Database connection successful"
else
    check_failed "Cannot connect to database at $DB_HOST"
fi

kubectl delete pod db-test -n ecommerce &> /dev/null

# 6. Verify Redis connectivity
echo ""
echo "6. Verifying Redis connectivity..."
REDIS_HOST=$(kubectl get configmap laravel-config -n ecommerce -o jsonpath="{.data.REDIS_HOST}")
REDIS_PASSWORD=$(kubectl get secret redis-password -n ecommerce -o jsonpath="{.data.password}" | base64 -d)

cat <<EOF | kubectl apply -f - &> /dev/null
apiVersion: v1
kind: Pod
metadata:
  name: redis-test
  namespace: ecommerce
spec:
  containers:
  - name: redis-client
    image: redis:7-alpine
    command: ['sleep', '30']
  restartPolicy: Never
EOF

sleep 3

if kubectl exec -n ecommerce redis-test -- redis-cli -h "$REDIS_HOST" -a "$REDIS_PASSWORD" PING | grep -q "PONG"; then
    check_passed "Redis connection successful"
else
    check_failed "Cannot connect to Redis at $REDIS_HOST"
fi

kubectl delete pod redis-test -n ecommerce &> /dev/null

# 7. Verify persistent volumes
echo ""
echo "7. Verifying persistent volumes..."
REQUIRED_PVCS=(
    "mysql-pvc"
    "redis-pvc"
)

for pvc in "${REQUIRED_PVCS[@]}"; do
    PVC_STATUS=$(kubectl get pvc "$pvc" -n ecommerce -o jsonpath="{.status.phase}" 2>/dev/null)
    if [[ "$PVC_STATUS" == "Bound" ]]; then
        check_passed "PVC '$pvc' is Bound"
    else
        check_failed "PVC '$pvc' is not Bound (status: $PVC_STATUS)"
    fi
done

# 8. Verify container images exist and are accessible
echo ""
echo "8. Verifying container images..."
REQUIRED_IMAGES=(
    "ghcr.io/ibekzod/laravel-ecommerce:latest"
    "ghcr.io/ibekzod/laravel-ecommerce-worker:latest"
)

for image in "${REQUIRED_IMAGES[@]}"; do
    # Try to pull image manifest (without downloading layers)
    if docker manifest inspect "$image" &> /dev/null; then
        check_passed "Image '$image' is accessible"
    else
        check_failed "Image '$image' cannot be accessed"
        echo "   Verify image exists and credentials are correct"
    fi
done

# 9. Verify monitoring stack is running
echo ""
echo "9. Verifying monitoring stack..."
MONITORING_SERVICES=(
    "prometheus-server"
    "grafana"
    "loki"
)

for service in "${MONITORING_SERVICES[@]}"; do
    if kubectl get deployment "$service" -n monitoring &> /dev/null; then
        READY=$(kubectl get deployment "$service" -n monitoring -o jsonpath="{.status.readyReplicas}")
        DESIRED=$(kubectl get deployment "$service" -n monitoring -o jsonpath="{.spec.replicas}")
        
        if [[ "$READY" == "$DESIRED" ]]; then
            check_passed "$service is running ($READY/$DESIRED replicas ready)"
        else
            check_warning "$service has $READY/$DESIRED replicas ready"
        fi
    else
        check_failed "$service deployment not found"
    fi
done

# 10. Verify backup system is configured
echo ""
echo "10. Verifying backup configuration..."
if kubectl get cronjob database-backup -n ecommerce &> /dev/null; then
    LAST_SCHEDULE=$(kubectl get cronjob database-backup -n ecommerce -o jsonpath="{.status.lastScheduleTime}")
    check_passed "Database backup CronJob exists (last run: $LAST_SCHEDULE)"
else
    check_failed "Database backup CronJob not found"
fi

# 11. Verify SSL/TLS certificates
echo ""
echo "11. Verifying SSL/TLS certificates..."
if kubectl get certificate ecommerce-tls -n ecommerce &> /dev/null; then
    CERT_READY=$(kubectl get certificate ecommerce-tls -n ecommerce -o jsonpath="{.status.conditions[?(@.type=='Ready')].status}")
    if [[ "$CERT_READY" == "True" ]]; then
        check_passed "TLS certificate is ready"
        
        # Check expiry
        CERT_EXPIRY=$(kubectl get secret ecommerce-tls -n ecommerce -o jsonpath="{.data.tls\.crt}" | base64 -d | openssl x509 -noout -enddate | cut -d= -f2)
        check_passed "  Certificate expires: $CERT_EXPIRY"
    else
        check_failed "TLS certificate not ready"
    fi
else
    check_failed "TLS certificate not found"
fi

# 12. Verify ingress configuration
echo ""
echo "12. Verifying ingress configuration..."
if kubectl get ingress laravel-ingress -n ecommerce &> /dev/null; then
    INGRESS_HOST=$(kubectl get ingress laravel-ingress -n ecommerce -o jsonpath="{.spec.rules[0].host}")
    check_passed "Ingress configured for host: $INGRESS_HOST"
    
    # Verify DNS resolution
    if host "$INGRESS_HOST" &> /dev/null; then
        INGRESS_IP=$(host "$INGRESS_HOST" | awk '/has address/ { print $4 }' | head -1)
        check_passed "  DNS resolves to: $INGRESS_IP"
    else
        check_warning "  DNS does not resolve for $INGRESS_HOST"
    fi
else
    check_failed "Ingress not found"
fi

# 13. Verify resource quotas are not exceeded
echo ""
echo "13. Verifying resource quotas..."
if kubectl get resourcequota -n ecommerce &> /dev/null; then
    QUOTA_OUTPUT=$(kubectl get resourcequota -n ecommerce -o json)
    check_passed "Resource quotas configured"
    
    # Parse and display quota usage
    USED_CPU=$(echo "$QUOTA_OUTPUT" | jq -r '.items[0].status.used."requests.cpu" // "0"')
    HARD_CPU=$(echo "$QUOTA_OUTPUT" | jq -r '.items[0].status.hard."requests.cpu" // "unlimited"')
    echo "  CPU: $USED_CPU / $HARD_CPU"
    
    USED_MEM=$(echo "$QUOTA_OUTPUT" | jq -r '.items[0].status.used."requests.memory" // "0"')
    HARD_MEM=$(echo "$QUOTA_OUTPUT" | jq -r '.items[0].status.hard."requests.memory" // "unlimited"')
    echo "  Memory: $USED_MEM / $HARD_MEM"
else
    check_warning "No resource quotas configured"
fi

# Final summary
echo ""
echo "=================================================="
if [ $VALIDATION_FAILED -eq 0 ]; then
    echo -e "${GREEN}✓ All validation checks passed!${NC}"
    echo "  Production deployment can proceed."
    echo "=================================================="
    exit 0
else
    echo -e "${RED}✗ Validation failed!${NC}"
    echo "  Fix the issues above before deploying to production."
    echo "=================================================="
    exit 1
fi

Make it executable and run before every deployment:

chmod +x scripts/validate-production-config.sh

# Run validation
./scripts/validate-production-config.sh

# Example output when issues found:
# ==================================================
# Production Configuration Validation
# ==================================================
#
# 1. Verifying Kubernetes cluster context...
# ✓ Connected to production cluster: production-cluster
#
# 2. Verifying namespace...
# ✓ Namespace 'ecommerce' exists
#
# 3. Verifying Kubernetes secrets...
# ✓ Secret 'laravel-secrets' exists
# ✓   Key 'app-key' present in secret 'laravel-secrets'
# ✗   Key 'db-password' missing from secret 'laravel-secrets'
# ✗ Secret 'stripe-api-keys' not found
#
# ==================================================
# ✗ Validation failed!
#   Fix the issues above before deploying to production.
# ==================================================

Blue-Green Deployment Strategy

Blue-green deployment enables zero-downtime releases by maintaining two identical production environments. Traffic switches instantly between them.

Why this matters: Rolling updates still cause issues during deployment. A bad release can affect 30% of users before you realize. Blue-green deployment allows instant rollback and validates releases before switching traffic.

Complete Blue-Green Implementation

# kubernetes/deployments/blue-green-deploy.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: ecommerce
---
# Blue deployment (currently serving production traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: laravel-blue
  namespace: ecommerce
  labels:
    app: laravel
    environment: blue
    version: v1.2.3
spec:
  replicas: 3
  selector:
    matchLabels:
      app: laravel
      environment: blue
  template:
    metadata:
      labels:
        app: laravel
        environment: blue
        version: v1.2.3
    spec:
      containers:
      - name: laravel
        image: ghcr.io/ibekzod/laravel-ecommerce:v1.2.3
        ports:
        - containerPort: 8080
        envFrom:
        - configMapRef:
            name: laravel-config
        - secretRef:
            name: laravel-secrets
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
---
# Green deployment (new version, not yet serving traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: laravel-green
  namespace: ecommerce
  labels:
    app: laravel
    environment: green
    version: v1.3.0
spec:
  replicas: 3
  selector:
    matchLabels:
      app: laravel
      environment: green
  template:
    metadata:
      labels:
        app: laravel
        environment: green
        version: v1.3.0
    spec:
      containers:
      - name: laravel
        image: ghcr.io/ibekzod/laravel-ecommerce:v1.3.0  # New version
        ports:
        - containerPort: 8080
        envFrom:
        - configMapRef:
            name: laravel-config
        - secretRef:
            name: laravel-secrets
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
---
# Service - routes to whichever environment is "active"
apiVersion: v1
kind: Service
metadata:
  name: laravel-service
  namespace: ecommerce
spec:
  selector:
    app: laravel
    environment: blue  # Change this to "green" to switch traffic
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: ClusterIP
---
# Blue service - for direct access during testing
apiVersion: v1
kind: Service
metadata:
  name: laravel-blue-service
  namespace: ecommerce
spec:
  selector:
    app: laravel
    environment: blue
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: ClusterIP
---
# Green service - for direct access during testing
apiVersion: v1
kind: Service
metadata:
  name: laravel-green-service
  namespace: ecommerce
spec:
  selector:
    app: laravel
    environment: green
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: ClusterIP

Automated blue-green deployment script:

#!/bin/bash
# scripts/blue-green-deploy.sh
#
# Automated blue-green deployment with smoke tests and rollback capability
#
# Usage: ./scripts/blue-green-deploy.sh <new-version> <environment-to-deploy>
# Example: ./scripts/blue-green-deploy.sh v1.3.0 green

set -e

NEW_VERSION=$1
TARGET_ENV=$2  # "blue" or "green"

if [[ -z "$NEW_VERSION" ]] || [[ -z "$TARGET_ENV" ]]; then
    echo "Usage: $0 <version> <environment>"
    echo "Example: $0 v1.3.0 green"
    exit 1
fi

# Determine which environment is currently active
CURRENT_ENV=$(kubectl get service laravel-service -n ecommerce -o jsonpath='{.spec.selector.environment}')
echo "Current active environment: $CURRENT_ENV"
echo "Deploying version $NEW_VERSION to $TARGET_ENV environment"

if [[ "$CURRENT_ENV" == "$TARGET_ENV" ]]; then
    echo "ERROR: Cannot deploy to currently active environment"
    echo "Deploy to the inactive environment first, then switch traffic"
    exit 1
fi

# Step 1: Deploy new version to target environment
echo ""
echo "Step 1: Deploying $NEW_VERSION to $TARGET_ENV..."
kubectl set image deployment/laravel-$TARGET_ENV \
    laravel=ghcr.io/ibekzod/laravel-ecommerce:$NEW_VERSION \
    -n ecommerce

# Wait for rollout to complete
echo "Waiting for deployment to complete..."
kubectl rollout status deployment/laravel-$TARGET_ENV -n ecommerce --timeout=5m

# Step 2: Run database migrations on new version (if any)
echo ""
echo "Step 2: Running database migrations..."
MIGRATION_POD=$(kubectl get pod -n ecommerce -l app=laravel,environment=$TARGET_ENV -o jsonpath='{.items[0].metadata.name}')

kubectl exec -n ecommerce $MIGRATION_POD -- php artisan migrate --force

# Step 3: Wait for all pods to be ready
echo ""
echo "Step 3: Waiting for all pods to be ready..."
kubectl wait --for=condition=ready pod \
    -l app=laravel,environment=$TARGET_ENV \
    -n ecommerce \
    --timeout=5m

# Step 4: Run smoke tests against new environment
echo ""
echo "Step 4: Running smoke tests against $TARGET_ENV environment..."

# Get the service endpoint for the target environment
TARGET_SERVICE="laravel-${TARGET_ENV}-service"
TARGET_URL="http://${TARGET_SERVICE}.ecommerce.svc.cluster.local"

# Create a test pod to run smoke tests from inside the cluster
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: smoke-test
  namespace: ecommerce
spec:
  containers:
  - name: curl
    image: curlimages/curl:latest
    command: ['sleep', '300']
  restartPolicy: Never
EOF

sleep 5  # Wait for pod to start

# Test 1: Health check
echo "  Testing health endpoint..."
kubectl exec -n ecommerce smoke-test -- curl -f -s "$TARGET_URL/health/ready" > /dev/null
echo "  ✓ Health check passed"

# Test 2: Homepage loads
echo "  Testing homepage..."
kubectl exec -n ecommerce smoke-test -- curl -f -s "$TARGET_URL/" > /dev/null
echo "  ✓ Homepage loads"

# Test 3: API endpoint
echo "  Testing API endpoint..."
kubectl exec -n ecommerce smoke-test -- curl -f -s "$TARGET_URL/api/products?limit=1" > /dev/null
echo "  ✓ API responds"

# Test 4: Database connectivity
echo "  Testing database connectivity..."
kubectl exec -n ecommerce $MIGRATION_POD -- php artisan tinker --execute="DB::connection()->getPdo();"
echo "  ✓ Database connected"

# Clean up test pod
kubectl delete pod smoke-test -n ecommerce

echo ""
echo "✓ All smoke tests passed"

# Step 5: Prompt to switch traffic
echo ""
echo "=================================================="
echo "Deployment to $TARGET_ENV environment complete!"
echo "New version $NEW_VERSION is ready but not receiving traffic."
echo ""
echo "To switch traffic to $TARGET_ENV environment, run:"
echo "  kubectl patch service laravel-service -n ecommerce -p '{\"spec\":{\"selector\":{\"environment\":\"$TARGET_ENV\"}}}'"
echo ""
echo "To rollback if issues occur, run:"
echo "  kubectl patch service laravel-service -n ecommerce -p '{\"spec\":{\"selector\":{\"environment\":\"$CURRENT_ENV\"}}}'"
echo "=================================================="
echo ""

# Optional: Automated traffic switch (uncomment if you want automatic switch)
# read -p "Switch traffic to $TARGET_ENV now? (y/N): " -n 1 -r
# echo
# if [[ $REPLY =~ ^[Yy]$ ]]; then
#     echo "Switching traffic to $TARGET_ENV..."
#     kubectl patch service laravel-service -n ecommerce -p "{\"spec\":{\"selector\":{\"environment\":\"$TARGET_ENV\"}}}"
#     echo "✓ Traffic switched to $TARGET_ENV"
#     echo "Monitor metrics at: https://grafana.yourdomain.com"
# fi

Run the deployment:

# Make script executable
chmod +x scripts/blue-green-deploy.sh

# Deploy new version to green environment
./scripts/blue-green-deploy.sh v1.3.0 green

# Output:
# Current active environment: blue
# Deploying version v1.3.0 to green environment
#
# Step 1: Deploying v1.3.0 to green...
# deployment.apps/laravel-green image updated
# Waiting for deployment to complete...
# deployment "laravel-green" successfully rolled out
#
# Step 2: Running database migrations...
# Nothing to migrate.
#
# Step 3: Waiting for all pods to be ready...
# pod/laravel-green-7d4f8c9b5-2xkwp condition met
# pod/laravel-green-7d4f8c9b5-8hjnm condition met
# pod/laravel-green-7d4f8c9b5-qz9rt condition met
#
# Step 4: Running smoke tests against green environment...
#   Testing health endpoint...
#   ✓ Health check passed
#   Testing homepage...
#   ✓ Homepage loads
#   Testing API endpoint...
#   ✓ API responds
#   Testing database connectivity...
#   ✓ Database connected
#
# ✓ All smoke tests passed
#
# ==================================================
# Deployment to green environment complete!
# New version v1.3.0 is ready but not receiving traffic.
#
# To switch traffic to green environment, run:
#   kubectl patch service laravel-service -n ecommerce -p '{"spec":{"selector":{"environment":"green"}}}'
#
# To rollback if issues occur, run:
#   kubectl patch service laravel-service -n ecommerce -p '{"spec":{"selector":{"environment":"blue"}}}'
# ==================================================

# After verifying everything looks good, switch traffic:
kubectl patch service laravel-service -n ecommerce \
    -p '{"spec":{"selector":{"environment":"green"}}}'

# If issues occur, instant rollback:
kubectl patch service laravel-service -n ecommerce \
    -p '{"spec":{"selector":{"environment":"blue"}}}'

The beauty of blue-green: Traffic switches in < 1 second. No gradual rollout. Either the new version works completely or you're back on the old version instantly.

Production Observability Stack

Observability is not monitoring. Monitoring tells you what is broken. Observability tells you why it's broken. We need both.

Complete Observability Stack Deployment

# Install Prometheus + Grafana + Loki
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Create monitoring namespace
kubectl create namespace monitoring

# Install Prometheus (metrics collection)
helm install prometheus prometheus-community/kube-prometheus-stack \
    --namespace monitoring \
    --set prometheus.prometheusSpec.retention=30d \
    --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi \
    --set grafana.enabled=true \
    --set grafana.adminPassword=ChangeThisPassword123! \
    --values - <<EOF
prometheus:
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false
    additionalScrapeConfigs:
    - job_name: 'laravel-metrics'
      kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
          - ecommerce
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: \$1:\$2
        target_label: __address__

grafana:
  persistence:
    enabled: true
    size: 10Gi
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
      - name: 'default'
        orgId: 1
        folder: ''
        type: file
        disableDeletion: false
        editable: true
        options:
          path: /var/lib/grafana/dashboards/default
  dashboards:
    default:
      laravel-application:
        url: https://grafana.com/api/dashboards/14504/revisions/1/download
      kubernetes-cluster:
        url: https://grafana.com/api/dashboards/7249/revisions/1/download
EOF

# Install Loki (log aggregation)
helm install loki grafana/loki-stack \
    --namespace monitoring \
    --set loki.persistence.enabled=true \
    --set loki.persistence.size=50Gi \
    --set promtail.enabled=true

# Install Jaeger (distributed tracing)
helm install jaeger jaegertracing/jaeger \
    --namespace monitoring \
    --set provisionDataStore.cassandra=false \
    --set allInOne.enabled=true \
    --set storage.type=memory \
    --set agent.enabled=false \
    --set collector.enabled=false \
    --set query.enabled=false

echo "Observability stack installed!"
echo ""
echo "Access Grafana:"
echo "  kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80"
echo "  Then visit: http://localhost:3000"
echo "  Username: admin"
echo "  Password: ChangeThisPassword123!"
echo ""
echo "Access Prometheus:"
echo "  kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090"
echo "  Then visit: http://localhost:9090"
echo ""
echo "Access Jaeger:"
echo "  kubectl port-forward -n monitoring svc/jaeger-query 16686:16686"
echo "  Then visit: http://localhost:16686"

Application Instrumentation for Observability

Add Prometheus metrics to Laravel:

<?php
// app/Http/Middleware/PrometheusMetrics.php

namespace App\Http\Middleware;

use Closure;
use Illuminate\Http\Request;
use Illuminate\Support\Facades\Cache;
use Symfony\Component\HttpFoundation\Response;

class PrometheusMetrics
{
    /**
     * Collect application metrics for Prometheus scraping
     */
    public function handle(Request $request, Closure $next): Response
    {
        $start_time = microtime(true);
        $start_memory = memory_get_usage(true);
        
        // Process request
        $response = $next($request);
        
        // Calculate metrics
        $duration = microtime(true) - $start_time;
        $memory_used = memory_get_usage(true) - $start_memory;
        
        // Increment request counter
        $this->incrementMetric('http_requests_total', [
            'method' => $request->method(),
            'path' => $request->route()?->uri() ?? 'unknown',
            'status' => $response->getStatusCode(),
        ]);
        
        // Record request duration histogram
        $this->recordHistogram('http_request_duration_seconds', $duration, [
            'method' => $request->method(),
            'path' => $request->route()?->uri() ?? 'unknown',
        ]);
        
        // Record memory usage
        $this->recordGauge('http_request_memory_bytes', $memory_used, [
            'method' => $request->method(),
        ]);
        
        // Track database queries
        $query_count = count(\DB::getQueryLog());
        if ($query_count > 0) {
            $this->recordHistogram('database_queries_per_request', $query_count, [
                'path' => $request->route()?->uri() ?? 'unknown',
            ]);
        }
        
        return $response;
    }
    
    /**
     * Increment a counter metric
     */
    private function incrementMetric(string $name, array $labels = []): void
    {
        $key = $this->buildMetricKey($name, $labels);
        Cache::increment($key);
        
        // Store metric metadata for exposition
        $this->storeMetricMetadata($name, 'counter', $labels);
    }
    
    /**
     * Record a histogram value
     * Simplified implementation - production would use proper histogram buckets
     */
    private function recordHistogram(string $name, float $value, array $labels = []): void
    {
        // Store in Redis sorted set for percentile calculations
        $key = $this->buildMetricKey($name, $labels);
        
        // Store value with timestamp as score
        \Redis::zadd("histogram:$key", time(), $value);
        
        // Keep only last 1000 values
        \Redis::zremrangebyrank("histogram:$key", 0, -1001);
        
        $this->storeMetricMetadata($name, 'histogram', $labels);
    }
    
    /**
     * Set a gauge value
     */
    private function recordGauge(string $name, float $value, array $labels = []): void
    {
        $key = $this->buildMetricKey($name, $labels);
        Cache::put($key, $value, now()->addMinutes(5));
        
        $this->storeMetricMetadata($name, 'gauge', $labels);
    }
    
    /**
     * Build cache key from metric name and labels
     */
    private function buildMetricKey(string $name, array $labels): string
    {
        $label_string = '';
        foreach ($labels as $key => $value) {
            $label_string .= "{$key}=\"{$value}\",";
        }
        return "metrics:{$name}{" . rtrim($label_string, ',') . "}";
    }
    
    /**
     * Store metric metadata for exposition format
     */
    private function storeMetricMetadata(string $name, string $type, array $labels): void
    {
        $metadata = [
            'type' => $type,
            'labels' => array_keys($labels),
        ];
        Cache::put("metrics:metadata:$name", $metadata, now()->addHours(24));
    }
}

Register middleware:

<?php
// app/Http/Kernel.php

protected $middleware = [
    // ... other middleware
    \App\Http\Middleware\PrometheusMetrics::class,
];

Create custom Grafana dashboard configuration:

{
  "dashboard": {
    "title": "Laravel E-Commerce Platform - Production Metrics",
    "tags": ["laravel", "ecommerce", "production"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Request Rate (req/s)",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total{namespace=\"ecommerce\"}[5m])",
            "legendFormat": "{{method}} {{path}}"
          }
        ],
        "yaxes": [
          {
            "label": "Requests/sec"
          }
        ]
      },
      {
        "title": "Response Time (p95)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{namespace=\"ecommerce\"}[5m]))",
            "legendFormat": "{{method}} {{path}}"
          }
        ],
        "yaxes": [
          {
            "label": "Seconds",
            "format": "s"
          }
        ]
      },
      {
        "title": "Error Rate (%)",
        "type": "graph",
        "targets": [
          {
            "expr": "100 * (rate(http_requests_total{namespace=\"ecommerce\",status=~\"5..\"}[5m]) / rate(http_requests_total{namespace=\"ecommerce\"}[5m]))",
            "legendFormat": "5xx Errors"
          }
        ],
        "yaxes": [
          {
            "label": "Percentage",
            "format": "percent"
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [1],
                "type": "gt"
              },
              "operator": {
                "type": "and"
              },
              "query": {
                "params": ["A", "5m", "now"]
              },
              "reducer": {
                "params": [],
                "type": "avg"
              },
              "type": "query"
            }
          ],
          "executionErrorState": "alerting",
          "for": "5m",
          "frequency": "1m",
          "handler": 1,
          "name": "High Error Rate Alert",
          "noDataState": "no_data",
          "notifications": []
        }
      },
      {
        "title": "Database Query Count per Request",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(database_queries_per_request_sum{namespace=\"ecommerce\"}[5m]) / rate(database_queries_per_request_count{namespace=\"ecommerce\"}[5m])",
            "legendFormat": "{{path}}"
          }
        ],
        "yaxes": [
          {
            "label": "Queries"
          }
        ]
      },
      {
        "title": "Active Pods",
        "type": "stat",
        "targets": [
          {
            "expr": "count(kube_pod_status_phase{namespace=\"ecommerce\",phase=\"Running\"})",
            "legendFormat": "Running Pods"
          }
        ]
      },
      {
        "title": "Pod CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"ecommerce\",pod=~\"laravel.*\"}[5m])) by (pod)",
            "legendFormat": "{{pod}}"
          }
        ],
        "yaxes": [
          {
            "label": "CPU Cores"
          }
        ]
      },
      {
        "title": "Pod Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(container_memory_usage_bytes{namespace=\"ecommerce\",pod=~\"laravel.*\"}) by (pod)",
            "legendFormat": "{{pod}}"
          }
        ],
        "yaxes": [
          {
            "label": "Bytes",
            "format": "bytes"
          }
        ]
      },
      {
        "title": "Redis Operations Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(redis_commands_processed_total{namespace=\"ecommerce\"}[5m])",
            "legendFormat": "Commands/sec"
          }
        ]
      },
      {
        "title": "Queue Job Processing Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(queue_jobs_processed_total{namespace=\"ecommerce\"}[5m])",
            "legendFormat": "{{queue}}"
          }
        ]
      },
      {
        "title": "Failed Jobs (last hour)",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(increase(queue_jobs_failed_total{namespace=\"ecommerce\"}[1h]))"
          }
        ],
        "thresholds": [
          {
            "value": 0,
            "color": "green"
          },
          {
            "value": 10,
            "color": "yellow"
          },
          {
            "value": 100,
            "color": "red"
          }
        ]
      }
    ]
  }
}

Save this as monitoring/grafana-dashboard.json and import into Grafana.

Incident Response Runbook

When production breaks at 2 AM, you need a runbook. Not documentation—a step-by-step recovery procedure.

Incident Classification & Response Times

Severity	Impact	Response Time	Example
P0 - Critical	Complete service outage	< 15 minutes	Database down, all requests failing
P1 - High	Major feature broken	< 1 hour	Payment processing failing
P2 - Medium	Degraded performance	< 4 hours	Slow response times, increased errors
P3 - Low	Minor issue	< 24 hours	Non-critical feature bug

P0: Critical Incident Response Procedure

#!/bin/bash
# runbooks/p0-incident-response.sh
#
# Execute this immediately when P0 incident is detected
# This script gathers diagnostic information and prepares rollback

set -e

echo "=========================================="
echo "P0 CRITICAL INCIDENT RESPONSE"
echo "Started at: $(date)"
echo "=========================================="
echo ""

# Create incident directory
INCIDENT_ID="incident-$(date +%Y%m%d-%H%M%S)"
mkdir -p "incidents/$INCIDENT_ID"
cd "incidents/$INCIDENT_ID"

echo "Incident ID: $INCIDENT_ID"
echo "Collecting diagnostic information..."
echo ""

# 1. Capture current cluster state
echo "1. Capturing cluster state..."
kubectl get all -n ecommerce > cluster-state.txt
kubectl get events -n ecommerce --sort-by='.lastTimestamp' > events.txt
kubectl top nodes > node-resources.txt
kubectl top pods -n ecommerce > pod-resources.txt

# 2. Check pod status
echo "2. Checking pod health..."
kubectl get pods -n ecommerce -o wide > pods-detailed.txt

UNHEALTHY_PODS=$(kubectl get pods -n ecommerce --field-selector=status.phase!=Running -o name)
if [[ -n "$UNHEALTHY_PODS" ]]; then
    echo "ALERT: Unhealthy pods detected:"
    echo "$UNHEALTHY_PODS"
    
    # Get logs from unhealthy pods
    for pod in $UNHEALTHY_PODS; do
        POD_NAME=$(echo $pod | cut -d'/' -f2)
        echo "  Collecting logs from $POD_NAME..."
        kubectl logs $pod -n ecommerce --tail=500 > "logs-${POD_NAME}.txt" 2>&1 || true
        kubectl describe $pod -n ecommerce > "describe-${POD_NAME}.txt" 2>&1 || true
    done
fi

# 3. Check service endpoints
echo "3. Checking service endpoints..."
kubectl get endpoints -n ecommerce > endpoints.txt

SERVICE_ENDPOINTS=$(kubectl get endpoints laravel-service -n ecommerce -o jsonpath='{.subsets[*].addresses[*].ip}' | wc -w)
if [[ $SERVICE_ENDPOINTS -eq 0 ]]; then
    echo "CRITICAL: No healthy endpoints for laravel-service!"
    echo "Service has zero ready pods - complete outage"
fi

# 4. Check recent deployments
echo "4. Checking recent deployments..."
kubectl rollout history deployment -n ecommerce > deployment-history.txt

# 5. Query Prometheus for error rates
echo "5. Querying error metrics..."
# This requires prometheus to be accessible
PROM_URL="http://prometheus-kube-prometheus-prometheus.monitoring:9090"
ERROR_RATE=$(curl -s "$PROM_URL/api/v1/query?query=rate(http_requests_total{namespace=\"ecommerce\",status=~\"5..\"}[5m])" | jq -r '.data.result[0].value[1]' 2>/dev/null || echo "N/A")
echo "Current 5xx error rate: $ERROR_RATE errors/sec"

# 6. Check database connectivity
echo "6. Checking database connectivity..."
DB_POD=$(kubectl get pod -n ecommerce -l app=mysql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
if [[ -n "$DB_POD" ]]; then
    kubectl exec -n ecommerce $DB_POD -- mysql -e "SELECT 1" > db-connectivity.txt 2>&1 && \
        echo "  Database is responding" || \
        echo "  ERROR: Database is not responding"
    
    # Check database connections
    kubectl exec -n ecommerce $DB_POD -- mysql -e "SHOW PROCESSLIST" > db-connections.txt 2>&1
    
    # Check database locks
    kubectl exec -n ecommerce $DB_POD -- mysql -e "SHOW ENGINE INNODB STATUS\G" > db-locks.txt 2>&1
else
    echo "  ERROR: Cannot find database pod"
fi

# 7. Check Redis connectivity
echo "7. Checking Redis connectivity..."
REDIS_POD=$(kubectl get pod -n ecommerce -l app=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
if [[ -n "$REDIS_POD" ]]; then
    kubectl exec -n ecommerce $REDIS_POD -- redis-cli PING > redis-connectivity.txt 2>&1 && \
        echo "  Redis is responding" || \
        echo "  ERROR: Redis is not responding"
    
    # Check Redis memory usage
    kubectl exec -n ecommerce $REDIS_POD -- redis-cli INFO memory > redis-memory.txt 2>&1
    
    # Check queue sizes
    kubectl exec -n ecommerce $REDIS_POD -- redis-cli LLEN queues:default > queue-size.txt 2>&1
else
    echo "  ERROR: Cannot find Redis pod"
fi

# 8. Generate incident summary
echo ""
echo "=========================================="
echo "DIAGNOSTIC SUMMARY"
echo "=========================================="

cat > incident-summary.txt <<EOF
Incident ID: $INCIDENT_ID
Timestamp: $(date)
Severity: P0 - Critical

CLUSTER STATE:
- Total pods: $(kubectl get pods -n ecommerce --no-headers | wc -l)
- Running pods: $(kubectl get pods -n ecommerce --field-selector=status.phase=Running --no-headers | wc -l)
- Failed pods: $(kubectl get pods -n ecommerce --field-selector=status.phase=Failed --no-headers | wc -l)
- Service endpoints: $SERVICE_ENDPOINTS

RECENT EVENTS:
$(kubectl get events -n ecommerce --sort-by='.lastTimestamp' | tail -10)

ERROR RATE:
- 5xx errors/sec: $ERROR_RATE

RECOMMENDED ACTIONS:
1. Review pod logs in logs-*.txt files
2. Check deployment-history.txt for recent changes
3. Verify database connectivity (db-connectivity.txt)
4. If recent deployment, consider rollback:
   kubectl rollout undo deployment/laravel-blue -n ecommerce
   kubectl rollout undo deployment/laravel-green -n ecommerce
5. If database issue, check db-locks.txt for deadlocks
6. If memory issue, check pod-resources.txt

ROLLBACK COMMANDS:
# Rollback to previous deployment
kubectl rollout undo deployment/laravel-blue -n ecommerce
kubectl rollout undo deployment/laravel-green -n ecommerce

# Or switch blue-green environment (if using blue-green)
kubectl patch service laravel-service -n ecommerce -p '{"spec":{"selector":{"environment":"blue"}}}'

# Scale up if pods are down
kubectl scale deployment/laravel-blue -n ecommerce --replicas=5

# Restart all pods (last resort)
kubectl rollout restart deployment/laravel-blue -n ecommerce
kubectl rollout restart deployment/laravel-green -n ecommerce
EOF

cat incident-summary.txt

echo ""
echo "=========================================="
echo "Diagnostic information saved to: incidents/$INCIDENT_ID"
echo ""
echo "Next steps:"
echo "1. Review incident-summary.txt"
echo "2. Execute rollback if needed"
echo "3. Notify team via Slack/PagerDuty"
echo "4. Create post-incident report"
echo "=========================================="

Make it executable:

chmod +x runbooks/p0-incident-response.sh

# Run during incident
./runbooks/p0-incident-response.sh

# Output saved to incidents/incident-20240315-143022/
# Review files and execute recommended actions

Automated Alerting Configuration

Configure PagerDuty integration with Prometheus AlertManager:

# monitoring/alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
  namespace: monitoring
type: Opaque
stringData:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m
      pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
    
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      receiver: 'team-pagerduty'
      routes:
      # P0 - Immediate page
      - match:
          severity: critical
        receiver: 'team-pagerduty'
        continue: true
      
      # P1 - Page during business hours, urgent slack otherwise
      - match:
          severity: warning
        receiver: 'team-slack'
      
      # P2/P3 - Slack only
      - match:
          severity: info
        receiver: 'team-slack'
    
    receivers:
    - name: 'team-pagerduty'
      pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
        severity: '{{ .CommonLabels.severity }}'
        details:
          firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
          resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'
          num_firing: '{{ .Alerts.Firing | len }}'
          num_resolved: '{{ .Alerts.Resolved | len }}'
    
    - name: 'team-slack'
      slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#production-alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        send_resolved: true

Define Prometheus alert rules:

# monitoring/prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: laravel-ecommerce-alerts
  namespace: monitoring
spec:
  groups:
  - name: application
    interval: 30s
    rules:
    
    # P0 - Critical: Complete service outage
    - alert: ServiceDown
      expr: up{job="laravel-metrics"} == 0
      for: 1m
      labels:
        severity: critical
        priority: P0
      annotations:
        summary: "Laravel service is down"
        description: "Laravel application in namespace {{ $labels.namespace }} has been down for more than 1 minute"
        runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-ServiceDown"
    
    # P0 - Critical: High error rate
    - alert: HighErrorRate
      expr: |
        (
          sum(rate(http_requests_total{namespace="ecommerce",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{namespace="ecommerce"}[5m]))
        ) > 0.05
      for: 2m
      labels:
        severity: critical
        priority: P0
      annotations:
        summary: "Error rate above 5%"
        description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"
        runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-HighErrorRate"
    
    # P0 - Critical: Database down
    - alert: DatabaseDown
      expr: up{job="mysql-exporter"} == 0
      for: 1m
      labels:
        severity: critical
        priority: P0
      annotations:
        summary: "Database is unreachable"
        description: "Cannot connect to MySQL database in namespace {{ $labels.namespace }}"
        runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-DatabaseDown"
    
    # P1 - High: Slow response times
    - alert: HighResponseTime
      expr: |
        histogram_quantile(0.95,
          rate(http_request_duration_seconds_bucket{namespace="ecommerce"}[5m])
        ) > 2
      for: 5m
      labels:
        severity: warning
        priority: P1
      annotations:
        summary: "95th percentile response time above 2 seconds"
        description: "P95 response time is {{ $value | humanizeDuration }} for {{ $labels.path }}"
        runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-SlowResponses"
    
    # P1 - High: Queue backup
    - alert: QueueBacklog
      expr: queue_jobs_pending > 1000
      for: 10m
      labels:
        severity: warning
        priority: P1
      annotations:
        summary: "Queue has {{ $value }} pending jobs"
        description: "Queue {{ $labels.queue }} has been backed up for 10 minutes"
        runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-QueueBacklog"
    
    # P1 - High: Pod crashlooping
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total{namespace="ecommerce"}[15m]) > 0
      for: 5m
      labels:
        severity: warning
        priority: P1
      annotations:
        summary: "Pod {{ $labels.pod }} is crash looping"
        description: "Pod has restarted {{ $value }} times in the last 15 minutes"
        runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-CrashLoop"
    
    # P2 - Medium: High memory usage
    - alert: HighMemoryUsage
      expr: |
        (
          container_memory_usage_bytes{namespace="ecommerce",pod=~"laravel.*"}
          /
          container_spec_memory_limit_bytes{namespace="ecommerce",pod=~"laravel.*"}
        ) > 0.85
      for: 10m
      labels:
        severity: warning
        priority: P2
      annotations:
        summary: "Pod {{ $labels.pod }} memory usage above 85%"
        description: "Memory usage is {{ $value | humanizePercentage }}"
        runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-HighMemory"
    
    # P2 - Medium: High CPU usage
    - alert: HighCPUUsage
      expr: |
        (
          rate(container_cpu_usage_seconds_total{namespace="ecommerce",pod=~"laravel.*"}[5m])
          /
          container_spec_cpu_quota{namespace="ecommerce",pod=~"laravel.*"}
        ) > 0.85
      for: 10m
      labels:
        severity: warning
        priority: P2
      annotations:
        summary: "Pod {{ $labels.pod }} CPU usage above 85%"
        description: "CPU usage is {{ $value | humanizePercentage }}"
        runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-HighCPU"
    
    # P2 - Medium: Disk space running low
    - alert: DiskSpaceLow
      expr: |
        (
          kubelet_volume_stats_available_bytes{namespace="ecommerce"}
          /
          kubelet_volume_stats_capacity_bytes{namespace="ecommerce"}
        ) < 0.15
      for: 5m
      labels:
        severity: warning
        priority: P2
      annotations:
        summary: "Disk space below 15% on {{ $labels.persistentvolumeclaim }}"
        description: "Only {{ $value | humanizePercentage }} disk space remaining"
        runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-DiskSpace"
    
    # P3 - Low: Certificate expiring soon
    - alert: CertificateExpiringSoon
      expr: (certmanager_certificate_expiration_timestamp_seconds - time()) < (7 * 24 * 3600)
      for: 1h
      labels:
        severity: info
        priority: P3
      annotations:
        summary: "Certificate {{ $labels.name }} expires in {{ $value | humanizeDuration }}"
        description: "TLS certificate will expire soon, renewal required"
        runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-CertRenewal"

  - name: business-metrics
    interval: 1m
    rules:
    
    # Business metric: Payment processing failure rate
    - alert: PaymentFailureSpike
      expr: |
        (
          rate(payment_transactions_total{status="failed"}[5m])
          /
          rate(payment_transactions_total[5m])
        ) > 0.10
      for: 3m
      labels:
        severity: critical
        priority: P0
        team: payments
      annotations:
        summary: "Payment failure rate above 10%"
        description: "{{ $value | humanizePercentage }} of payments are failing"
        impact: "Direct revenue loss"
        runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-PaymentFailures"
    
    # Business metric: Order processing stalled
    - alert: NoOrdersProcessed
      expr: rate(orders_created_total[10m]) == 0
      for: 10m
      labels:
        severity: warning
        priority: P1
        team: orders
      annotations:
        summary: "No orders processed in last 10 minutes"
        description: "Order creation has stopped - possible system issue"
        runbook_url: "https://github.com/iBekzod/laravel-ecommerce/wiki/Runbook-OrderProcessing"

Apply the configurations:

# Apply alert rules
kubectl apply -f monitoring/prometheus-rules.yaml

# Apply alertmanager config
kubectl apply -f monitoring/alertmanager-config.yaml

# Restart alertmanager to pick up new config
kubectl rollout restart statefulset/alertmanager-prometheus-kube-prometheus-alertmanager -n monitoring

Cost Optimization in Production

Production costs spiral without active management. Here's how to optimize without sacrificing reliability.

1. Right-Sizing Resources Based on Actual Usage

#!/bin/bash
# scripts/analyze-resource-usage.sh
#
# Analyzes actual resource usage vs requested/limits
# Identifies opportunities for cost reduction

echo "Resource Usage Analysis"
echo "======================="
echo ""

# Get all pods in ecommerce namespace
PODS=$(kubectl get pods -n ecommerce -o jsonpath='{.items[*].metadata.name}')

echo "Pod Resource Usage vs Requests/Limits:"
echo ""
printf "%-40s %10s %10s %10s %10s %15s\n" "POD" "CPU_USE" "CPU_REQ" "MEM_USE" "MEM_REQ" "OPTIMIZATION"

for pod in $PODS; do
    # Get actual usage
    CPU_USAGE=$(kubectl top pod $pod -n ecommerce --no-headers | awk '{print $2}')
    MEM_USAGE=$(kubectl top pod $pod -n ecommerce --no-headers | awk '{print $3}')
    
    # Get requests
    CPU_REQUEST=$(kubectl get pod $pod -n ecommerce -o jsonpath='{.spec.containers[0].resources.requests.cpu}')
    MEM_REQUEST=$(kubectl get pod $pod -n ecommerce -o jsonpath='{.spec.containers[0].resources.requests.memory}')
    
    # Calculate utilization
    CPU_USAGE_NUM=$(echo $CPU_USAGE | sed 's/m//')
    CPU_REQUEST_NUM=$(echo $CPU_REQUEST | sed 's/m//')
    
    if [[ $CPU_REQUEST_NUM -gt 0 ]]; then
        CPU_UTIL=$((CPU_USAGE_NUM * 100 / CPU_REQUEST_NUM))
    else
        CPU_UTIL=0
    fi
    
    # Provide optimization recommendation
    if [[ $CPU_UTIL -lt 30 ]]; then
        OPTIMIZATION="REDUCE_REQUESTS"
    elif [[ $CPU_UTIL -gt 80 ]]; then
        OPTIMIZATION="INCREASE_REQUESTS"
    else
        OPTIMIZATION="OK"
    fi
    
    printf "%-40s %10s %10s %10s %10s %15s\n" \
        "$pod" "$CPU_USAGE" "$CPU_REQUEST" "$MEM_USAGE" "$MEM_REQUEST" "$OPTIMIZATION"
done

echo ""
echo "Cost Optimization Recommendations:"
echo "==================================="

# Calculate total cluster costs (example with AWS EKS pricing)
NODE_COUNT=$(kubectl get nodes --no-headers | wc -l)
NODE_TYPE="t3.large"  # Adjust to your instance type
COST_PER_NODE_HOUR=0.0832  # t3.large on-demand price
HOURS_PER_MONTH=730

MONTHLY_COST=$(echo "$NODE_COUNT * $COST_PER_NODE_HOUR * $HOURS_PER_MONTH" | bc)

echo "Current cluster costs:"
echo "  Nodes: $NODE_COUNT x $NODE_TYPE"
echo "  Estimated monthly cost: \$$MONTHLY_COST"
echo ""

# Recommendations
echo "1. Consider using Spot Instances for non-critical workloads"
echo "   Potential savings: 60-90%"
echo ""
echo "2. Enable Cluster Autoscaler to scale nodes based on demand"
echo "   Average savings: 30-40%"
echo ""
echo "3. Use Horizontal Pod Autoscaler for application scaling"
echo "   Prevents over-provisioning"
echo ""
echo "4. Implement PodDisruptionBudgets for safe scaling"
echo "   Maintains availability during scale-down"

2. Implement Horizontal Pod Autoscaler

# kubernetes/autoscaling/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: laravel-hpa
  namespace: ecommerce
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: laravel-blue
  minReplicas: 2
  maxReplicas: 10
  metrics:
  # Scale based on CPU usage
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  
  # Scale based on memory usage
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  
  # Scale based on custom metric (requests per second)
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60
      selectPolicy: Min
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 4
        periodSeconds: 30
      selectPolicy: Max
---
# Worker HPA - different scaling characteristics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: laravel-worker-hpa
  namespace: ecommerce
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: laravel-worker
  minReplicas: 1
  maxReplicas: 5
  metrics:
  # Scale based on queue length
  - type: External
    external:
      metric:
        name: redis_queue_length
        selector:
          matchLabels:
            queue: default
      target:
        type: AverageValue
        averageValue: "100"
  
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 600  # Wait 10 minutes before scaling down
      policies:
      - type: Pods
        value: 1
        periodSeconds: 120
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60

Apply and verify HPA:

kubectl apply -f kubernetes/autoscaling/hpa.yaml

# Verify HPA is working
kubectl get hpa -n ecommerce

# Output:
# NAME           REFERENCE               TARGETS              MINPODS   MAXPODS   REPLICAS
# laravel-hpa    Deployment/laravel-blue 45%/70%, 52%/80%    2         10        3
# laravel-worker Deployment/laravel-worker 73/100            1         5         2

# Watch HPA in action
kubectl get hpa -n ecommerce --watch

3. Cluster Autoscaler for Node Management

# kubernetes/autoscaling/cluster-autoscaler.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
  - apiGroups: [""]
    resources: ["events", "endpoints"]
    verbs: ["create", "patch"]
  - apiGroups: [""]
    resources: ["pods/eviction"]
    verbs: ["create"]
  - apiGroups: [""]
    resources: ["pods/status"]
    verbs: ["update"]
  - apiGroups: [""]
    resources: ["endpoints"]
    resourceNames: ["cluster-autoscaler"]
    verbs: ["get", "update"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["watch", "list", "get", "update"]
  - apiGroups: [""]
    resources:
      - "namespaces"
      - "pods"
      - "services"
      - "replicationcontrollers"
      - "persistentvolumeclaims"
      - "persistentvolumes"
    verbs: ["watch", "list", "get"]
  - apiGroups: ["extensions"]
    resources: ["replicasets", "daemonsets"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["policy"]
    resources: ["poddisruptionbudgets"]
    verbs: ["watch", "list"]
  - apiGroups: ["apps"]
    resources: ["statefulsets", "replicasets", "daemonsets"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["batch", "extensions"]
    resources: ["jobs"]
    verbs: ["get", "list", "watch", "patch"]
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["create"]
  - apiGroups: ["coordination.k8s.io"]
    resourceNames: ["cluster-autoscaler"]
    resources: ["leases"]
    verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["create","list","watch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    resourceNames: ["cluster-autoscaler-status", "cluster-autoscaler-priority-expander"]
    verbs: ["delete", "get", "update", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      priorityClassName: system-cluster-critical
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
        fsGroup: 65534
      serviceAccountName: cluster-autoscaler
      containers:
        - image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.0
          name: cluster-autoscaler
          resources:
            limits:
              cpu: 100m
              memory: 600Mi
            requests:
              cpu: 100m
              memory: 600Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/production-cluster
            - --balance-similar-node-groups
            - --skip-nodes-with-system-pods=false
            - --scale-down-unneeded-time=10m
            - --scale-down-delay-after-add=10m
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
          imagePullPolicy: "Always"
      volumes:
        - name: ssl-certs
          hostPath:
            path: "/etc/ssl/certs/ca-bundle.crt"

Performance Monitoring & SLO Tracking

Service Level Objectives (SLOs) define acceptable performance. Track them rigorously.

Define SLOs for E-Commerce Platform

# monitoring/slo-definitions.yaml
# Service Level Objectives for Laravel E-Commerce Platform
#
# SLI (Service Level Indicator): What you measure
# SLO (Service Level Objective): Target value
# SLA (Service Level Agreement): What you promise customers

slos:
  # Availability SLO: 99.9% uptime (43 minutes downtime per month)
  - name: "availability"
    target: 0.999
    window: "30d"
    sli:
      query: |
        sum(rate(http_requests_total{namespace="ecommerce",status!~"5.."}[5m]))
        /
        sum(rate(http_requests_total{namespace="ecommerce"}[5m]))
    error_budget: 0.001  # 0.1% = 43 minutes per month
    
  # Latency SLO: 95% of requests under 500ms
  - name: "latency_p95"
    target: 0.5  # seconds
    percentile: 95
    window: "7d"
    sli:
      query: |
        histogram_quantile(0.95,
          rate(http_request_duration_seconds_bucket{namespace="ecommerce"}[5m])
        )
    
  # Payment Success Rate: 99.5% of payments succeed
  - name: "payment_success_rate"
    target: 0.995
    window: "30d"
    sli:
      query: |
        sum(rate(payment_transactions_total{status="success"}[5m]))
        /
        sum(rate(payment_transactions_total[5m]))
    error_budget: 0.005
  
  # Order Processing Time: 95% of orders processed within 30 seconds
  - name: "order_processing_time_p95"
    target: 30  # seconds
    percentile: 95
    window: "7d"
    sli:
      query: |
        histogram_quantile(0.95,
          rate(order_processing_duration_seconds_bucket[5m])
        )

Implement SLO tracking in Laravel:

<?php
// app/Services/SLOTracker.php

namespace App\Services;

use Illuminate\Support\Facades\Redis;
use Illuminate\Support\Facades\Log;

class SLOTracker
{
    /**
     * Track availability SLI
     */
    public function trackRequest(string $status_code, float $duration): void
    {
        $timestamp = now()->timestamp;
        $success = !str_starts_with($status_code, '5');
        
        // Store in time-series for SLO calculation
        Redis::zadd('slo:availability:requests', $timestamp, json_encode([
            'timestamp' => $timestamp,
            'success' => $success,
            'status' => $status_code,
        ]));
        
        // Store latency measurement
        Redis::zadd('slo:latency:measurements', $timestamp, $duration);
        
        // Keep only last 30 days of data
        $cutoff = now()->subDays(30)->timestamp;
        Redis::zremrangebyscore('slo:availability:requests', '-inf', $cutoff);
        Redis::zremrangebyscore('slo:latency:measurements', '-inf', $cutoff);
        
        // Check if we're burning error budget too fast
        $this->checkErrorBudgetBurn();
    }
    
    /**
     * Track payment transaction outcome
     */
    public function trackPayment(bool $success, float $amount): void
    {
        $timestamp = now()->timestamp;
        
        Redis::zadd('slo:payments:transactions', $timestamp, json_encode([
            'timestamp' => $timestamp,
            'success' => $success,
            'amount' => $amount,
        ]));
        
        // Alert if payment success rate drops below threshold
        if (!$success) {
            $recent_failure_rate = $this->calculateRecentPaymentFailureRate();
            if ($recent_failure_rate > 0.02) {  // 2% failure rate
                $this->alertHighPaymentFailureRate($recent_failure_rate);
            }
        }
    }
    
    /**
     * Calculate current SLO compliance
     */
    public function calculateSLOCompliance(string $slo_name, int $window_days = 30): array
    {
        $cutoff = now()->subDays($window_days)->timestamp;
        
        switch ($slo_name) {
            case 'availability':
                return $this->calculateAvailabilitySLO($cutoff);
            
            case 'latency':
                return $this->calculateLatencySLO($cutoff);
            
            case 'payment_success':
                return $this->calculatePaymentSuccessSLO($cutoff);
            
            default:
                throw new \InvalidArgumentException("Unknown SLO: $slo_name");
        }
    }
    
    /**
     * Calculate availability SLO
     */
    private function calculateAvailabilitySLO(int $cutoff): array
    {
        $requests = Redis::zrangebyscore('slo:availability:requests', $cutoff, '+inf');
        
        $total = count($requests);
        $successful = 0;
        
        foreach ($requests as $request_json) {
            $request = json_decode($request_json, true);
            if ($request['success']) {
                $successful++;
            }
        }
        
        $availability = $total > 0 ? $successful / $total : 1.0;
        $target = 0.999;
        $error_budget_remaining = 1 - (($target - $availability) / (1 - $target));
        
        return [
            'slo_name' => 'availability',
            'target' => $target,
            'actual' => $availability,
            'compliant' => $availability >= $target,
            'error_budget_remaining' => max(0, $error_budget_remaining),
            'total_requests' => $total,
            'successful_requests' => $successful,
        ];
    }
    
    /**
     * Calculate latency SLO (P95)
     */
    private function calculateLatencySLO(int $cutoff): array
    {
        $measurements = Redis::zrangebyscore('slo:latency:measurements', $cutoff, '+inf');
        
        if (empty($measurements)) {
            return [
                'slo_name' => 'latency_p95',
                'target' => 0.5,
                'actual' => 0,
                'compliant' => true,
                'sample_count' => 0,
            ];
        }
        
        sort($measurements, SORT_NUMERIC);
        $p95_index = (int) (count($measurements) * 0.95);
        $p95_latency = $measurements[$p95_index];
        
        $target = 0.5;  // 500ms
        
        return [
            'slo_name' => 'latency_p95',
            'target' => $target,
            'actual' => $p95_latency,
            'compliant' => $p95_latency <= $target,
            'sample_count' => count($measurements),
        ];
    }
    
    /**
     * Check if we're burning error budget too quickly
     * This detects rapid degradation before we violate SLO
     */
    private function checkErrorBudgetBurn(): void
    {
        // Look at last hour
        $one_hour_ago = now()->subHour()->timestamp;
        $requests = Redis::zrangebyscore('slo:availability:requests', $one_hour_ago, '+inf');
        
        if (count($requests) < 100) {
            return;  // Not enough data
        }
        
        $failures = 0;
        foreach ($requests as $request_json) {
            $request = json_decode($request_json, true);
            if (!$request['success']) {
                $failures++;
            }
        }
        
        $error_rate = $failures / count($requests);
        
        // If error rate in last hour > 1%, we're burning budget 10x faster than sustainable
        if ($error_rate > 0.01) {
            Log::critical('Rapid error budget burn detected', [
                'error_rate' => $error_rate,
                'failures_last_hour' => $failures,
                'requests_last_hour' => count($requests),
                'burn_rate' => $error_rate / 0.001,  // Relative to monthly budget
            ]);
            
            // Trigger P1 alert
            $this->alertErrorBudgetBurn($error_rate);
        }
    }
    
    /**
     * Calculate recent payment failure rate (last 5 minutes)
     */
    private function calculateRecentPaymentFailureRate(): float
    {
        $five_min_ago = now()->subMinutes(5)->timestamp;
        $transactions = Redis::zrangebyscore('slo:payments:transactions', $five_min_ago, '+inf');
        
        if (empty($transactions)) {
            return 0.0;
        }
        
        $failures = 0;
        foreach ($transactions as $transaction_json) {
            $transaction = json_decode($transaction_json, true);
            if (!$transaction['success']) {
                $failures++;
            }
        }
        
        return $failures / count($transactions);
    }
    
    /**
     * Alert on rapid error budget consumption
     */
    private function alertErrorBudgetBurn(float $error_rate): void
    {
        // Integration with your alerting system
        // This would trigger PagerDuty, Slack, etc.
        event(new \App\Events\ErrorBudgetBurnAlert([
            'error_rate' => $error_rate,
            'severity' => 'high',
            'message' => "Error budget burning at {$error_rate}% rate",
        ]));
    }
    
    /**
     * Alert on high payment failure rate
     */
    private function alertHighPaymentFailureRate(float $failure_rate): void
    {
        event(new \App\Events\PaymentFailureRateAlert([
            'failure_rate' => $failure_rate,
            'severity' => 'critical',
            'message' => "Payment failure rate at {$failure_rate}%",
        ]));
    }
}

Disaster Recovery & Business Continuity

Hope is not a strategy. Test your disaster recovery plan before you need it.

Automated Database Backup System

# kubernetes/cronjobs/database-backup.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: database-backup
  namespace: ecommerce
spec:
  # Run every 6 hours
  schedule: "0 */6 * * *"
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: backup
            image: mysql:8.0
            env:
            - name: DB_HOST
              valueFrom:
                secretKeyRef:
                  name: laravel-secrets
                  key: db-host
            - name: DB_USER
              valueFrom:
                secretKeyRef:
                  name: laravel-secrets
                  key: db-user
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: laravel-secrets
                  key: db-password
            - name: DB_NAME
              value: "ecommerce"
            - name: S3_BUCKET
              value: "ecommerce-backups"
            - name: AWS_REGION
              value: "us-east-1"
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: s3-credentials
                  key: access-key-id
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: s3-credentials
                  key: secret-access-key
            command:
            - /bin/bash
            - -c
            - |
              set -e
              
              BACKUP_DATE=$(date +%Y%m%d-%H%M%S)
              BACKUP_FILE="backup-${BACKUP_DATE}.sql.gz"
              
              echo "Starting database backup at $(date)"
              
              # Create backup with mysqldump
              mysqldump \
                --host="$DB_HOST" \
                --user="$DB_USER" \
                --password="$DB_PASSWORD" \
                --single-transaction \
                --quick \
                --lock-tables=false \
                --routines \
                --triggers \
                --events \
                "$DB_NAME" | gzip > "/tmp/$BACKUP_FILE"
              
              # Verify backup file exists and is not empty
              if [ ! -s "/tmp/$BACKUP_FILE" ]; then
                echo "ERROR: Backup file is empty or does not exist"
                exit 1
              fi
              
              BACKUP_SIZE=$(du -h "/tmp/$BACKUP_FILE" | cut -f1)
              echo "Backup created: $BACKUP_FILE (size: $BACKUP_SIZE)"
              
              # Upload to S3
              apt-get update && apt-get install -y awscli
              
              aws s3 cp "/tmp/$BACKUP_FILE" \
                "s3://${S3_BUCKET}/mysql/${BACKUP_FILE}" \
                --region "$AWS_REGION"
              
              echo "Backup uploaded to S3: s3://${S3_BUCKET}/mysql/${BACKUP_FILE}"
              
              # Verify upload
              aws s3 ls "s3://${S3_BUCKET}/mysql/${BACKUP_FILE}" --region "$AWS_REGION"
              
              # Delete local backup
              rm "/tmp/$BACKUP_FILE"
              
              # Delete backups older than 30 days
              echo "Cleaning up old backups..."
              CUTOFF_DATE=$(date -d '30 days ago' +%Y%m%d)
              aws s3 ls "s3://${S3_BUCKET}/mysql/" --region "$AWS_REGION" | \
                while read -r line; do
                  FILE_DATE=$(echo $line | awk '{print $4}' | grep -oP 'backup-\K[0-9]{8}')
                  if [ ! -z "$FILE_DATE" ] && [ "$FILE_DATE" -lt "$CUTOFF_DATE" ]; then
                    FILE_NAME=$(echo $line | awk '{print $4}')
                    echo "Deleting old backup: $FILE_NAME"
                    aws s3 rm "s3://${S3_BUCKET}/mysql/${FILE_NAME}" --region "$AWS_REGION"
                  fi
                done
              
              echo "Backup completed successfully at $(date)"

Disaster recovery restoration script:

#!/bin/bash
# scripts/restore-from-backup.sh
#
# Restore database from S3 backup
# 
# Usage: ./scripts/restore-from-backup.sh <backup-file-name>
# Example: ./scripts/restore-from-backup.sh backup-20240315-120000.sql.gz

set -e

BACKUP_FILE=$1

if [[ -z "$BACKUP_FILE" ]]; then
    echo "Usage: $0 <backup-file-name>"
    echo ""
    echo "Available backups:"
    aws s3 ls s3://ecommerce-backups/mysql/ | grep backup- | awk '{print $4}'
    exit 1
fi

echo "=========================================="
echo "DATABASE DISASTER RECOVERY"
echo "=========================================="
echo ""
echo "WARNING: This will REPLACE the current database with backup from:"
echo "  $BACKUP_FILE"
echo ""
read -p "Are you absolutely sure? Type 'RESTORE' to continue: " CONFIRM

if [[ "$CONFIRM" != "RESTORE" ]]; then
    echo "Restoration cancelled"
    exit 1
fi

# Download backup from S3
echo "Downloading backup from S3..."
aws s3 cp "s3://ecommerce-backups/mysql/$BACKUP_FILE" "/tmp/$BACKUP_FILE"

# Verify download
if [ ! -f "/tmp/$BACKUP_FILE" ]; then
    echo "ERROR: Failed to download backup file"
    exit 1
fi

echo "Backup downloaded: $(du -h /tmp/$BACKUP_FILE | cut -f1)"

# Get database credentials
DB_POD=$(kubectl get pod -n ecommerce -l app=mysql -o jsonpath='{.items[0].metadata.name}')
DB_HOST=$(kubectl get secret laravel-secrets -n ecommerce -o jsonpath="{.data.db-host}" | base64 -d)
DB_USER=$(kubectl get secret laravel-secrets -n ecommerce -o jsonpath="{.data.db-user}" | base64 -d)
DB_PASS=$(kubectl get secret laravel-secrets -n ecommerce -o jsonpath="{.data.db-password}" | base64 -d)
DB_NAME="ecommerce"

# Create backup of current database before restoration
echo "Creating safety backup of current database..."
kubectl exec -n ecommerce $DB_POD -- mysqldump \
    -u"$DB_USER" -p"$DB_PASS" "$DB_NAME" | gzip > "/tmp/pre-restore-backup-$(date +%Y%m%d-%H%M%S).sql.gz"

# Copy backup file to database pod
echo "Copying backup to database pod..."
kubectl cp "/tmp/$BACKUP_FILE" "ecommerce/$DB_POD:/tmp/$BACKUP_FILE"

# Restore database
echo "Restoring database..."
kubectl exec -n ecommerce $DB_POD -- bash -c "
    set -e
    echo 'Dropping existing database...'
    mysql -u$DB_USER -p$DB_PASS -e 'DROP DATABASE IF EXISTS $DB_NAME'
    
    echo 'Creating fresh database...'
    mysql -u$DB_USER -p$DB_PASS -e 'CREATE DATABASE $DB_NAME CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci'
    
    echo 'Restoring from backup...'
    gunzip < /tmp/$BACKUP_FILE | mysql -u$DB_USER -p$DB_PASS $DB_NAME
    
    echo 'Restoration complete'
    rm /tmp/$BACKUP_FILE
"

# Verify restoration
echo "Verifying restoration..."
TABLE_COUNT=$(kubectl exec -n ecommerce $DB_POD -- mysql -u"$DB_USER" -p"$DB_PASS" "$DB_NAME" -e "SHOW TABLES" | wc -l)

echo "Restoration complete!"
echo "  Tables in database: $TABLE_COUNT"
echo ""
echo "Next steps:"
echo "1. Verify application functionality"
echo "2. Run: kubectl rollout restart deployment/laravel-blue -n ecommerce"
echo "3. Monitor error rates in Grafana"
echo ""
echo "Pre-restoration backup saved to: /tmp/pre-restore-backup-*.sql.gz"

Lessons Learned from Production Outages

Real failures teach better than any tutorial. Here are expensive lessons from production incidents.

Lesson 1: Connection Pool Exhaustion (Black Friday 2023)

What happened: At 9 PM on Black Friday, response times spiked to 45 seconds. Error rate hit 23%. Database connections maxed out at 100, but application needed 300+.

Root cause: Laravel's default database connection pooling doesn't account for concurrent workers. Each PHP-FPM worker opens its own connections.

The fix:

<?php
// config/database.php - Proper connection management

'mysql' => [
    'driver' => 'mysql',
    'host' => env('DB_HOST', '127.0.0.1'),
    'port' => env('DB_PORT', '3306'),
    'database' => env('DB_DATABASE', 'forge'),
    'username' => env('DB_USERNAME', 'forge'),
    'password' => env('DB_PASSWORD', ''),
    'unix_socket' => env('DB_SOCKET', ''),
    'charset' => 'utf8mb4',
    'collation' => 'utf8mb4_unicode_ci',
    'prefix' => '',
    'prefix_indexes' => true,
    'strict' => true,
    'engine' => null,
    'options' => extension_loaded('pdo_mysql') ? array_filter([
        // Critical: Set connection timeout
        PDO::ATTR_TIMEOUT => 3,
        
        // Critical: Enable persistent connections to reuse TCP connections
        PDO::ATTR_PERSISTENT => env('DB_PERSISTENT', false),
        
        // Set reasonable wait_timeout on MySQL side
        PDO::MYSQL_ATTR_INIT_COMMAND => "SET SESSION wait_timeout=600",
    ]) : [],
    
    // Critical: Pool configuration
    'pool' => [
        'min_connections' => env('DB_POOL_MIN', 2),
        'max_connections' => env('DB_POOL_MAX', 10),
        'connect_timeout' => 3.0,
        'wait_timeout' => 600,
        'idle_timeout' => 60,
        'max_lifetime' => 3600,
    ],
],

Monitor connection usage:

<?php
// app/Console/Commands/MonitorDatabaseConnections.php

namespace App\Console\Commands;

use Illuminate\Console\Command;
use Illuminate\Support\Facades\DB;

class MonitorDatabaseConnections extends Command
{
    protected $signature = 'db:monitor-connections';
    protected $description = 'Monitor database connection usage';
    
    public function handle()
    {
        while (true) {
            $connections = DB::select("SHOW PROCESSLIST");
            $active = count(array_filter($connections, fn($c) => $c->Command !== 'Sleep'));
            $total = count($connections);
            
            $this->info(sprintf(
                '[%s] Connections: %d active / %d total',
                now()->toDateTimeString(),
                $active,
                $total
            ));
            
            // Alert if nearing limit
            if ($total > 80) {
                $this->error("WARNING: High connection count!");
            }
            
            sleep(5);
        }
    }
}

Lesson 2: Cache Stampede During Deployment

What happened: After deploying new code, cache was cleared. 10,000+ requests simultaneously tried to rebuild the same cached product catalog. Database CPU hit 100%, response times went to 30+ seconds.

The fix - Cache warming strategy:

<?php
// app/Console/Commands/WarmCache.php

namespace App\Console\Commands;

use Illuminate\Console\Command;
use Illuminate\Support\Facades\Cache;
use App\Models\Product;

class WarmCache extends Command
{
    protected $signature = 'cache:warm';
    protected $description = 'Pre-warm critical caches before deployment';
    
    public function handle()
    {
        $this->info('Warming critical caches...');
        
        // Warm product catalog cache
        $this->warmProductCatalog();
        
        // Warm category tree
        $this->warmCategoryTree();
        
        // Warm popular searches
        $this->warmPopularSearches();
        
        $this->info('Cache warming complete!');
    }
    
    private function warmProductCatalog(): void
    {
        $this->info('  Warming product catalog...');
        
        Cache::remember('products:featured', 3600, function() {
            return Product::where('featured', true)
                ->with(['images', 'categories'])
                ->get();
        });
        
        Cache::remember('products:bestsellers', 3600, function() {
            return Product::orderBy('sales_count', 'desc')
                ->limit(50)
                ->with(['images', 'categories'])
                ->get();
        });
    }
    
    private function warmCategoryTree(): void
    {
        $this->info('  Warming category tree...');
        
        Cache::remember('categories:tree', 3600, function() {
            return \App\Models\Category::with('children')->whereNull('parent_id')->get();
        });
    }
    
    private function warmPopularSearches(): void
    {
        $this->info('  Warming popular search queries...');
        
        $popular_queries = ['laptop', 'phone', 'headphones', 'camera'];
        
        foreach ($popular_queries as $query) {
            Cache::remember("search:$query", 1800, function() use ($query) {
                return Product::search($query)->take(20)->get();
            });
        }
    }
}

Run cache warming BEFORE traffic switches:

# In your deployment script, before switching blue-green
kubectl exec -n ecommerce deployment/laravel-green -- php artisan cache:warm

# Then switch traffic
kubectl patch service laravel-service -n ecommerce \
    -p '{"spec":{"selector":{"environment":"green"}}}'

Lesson 3: Silent Data Corruption from Race Condition

What happened: Customer reported double-charging. Investigation found race condition in payment processing. Two requests for same order ID both succeeded because database check happened before insert.

The fix - Idempotency keys and database constraints:

<?php
// database/migrations/2024_03_15_add_idempotency_to_payments.php

use Illuminate\Database\Migrations\Migration;
use Illuminate\Database\Schema\Blueprint;
use Illuminate\Support\Facades\Schema;

return new class extends Migration
{
    public function up()
    {
        Schema::table('payments', function (Blueprint $table) {
            // Add idempotency key column
            $table->string('idempotency_key', 64)->nullable()->after('id');
            
            // Create unique index to prevent duplicate processing
            $table->unique('idempotency_key', 'payments_idempotency_key_unique');
            
            // Add index for fast lookups
            $table->index(['order_id', 'status'], 'payments_order_status_index');
        });
        
        // Ensure order_id + status combination is unique for successful payments
        DB::statement('
            CREATE UNIQUE INDEX payments_order_success_unique 
            ON payments (order_id) 
            WHERE status = "success"
        ');
    }
    
    public function down()
    {
        DB::statement('DROP INDEX payments_order_success_unique ON payments');
        
        Schema::table('payments', function (Blueprint $table) {
            $table->dropUnique('payments_idempotency_key_unique');
            $table->dropIndex('payments_order_status_index');
            $table->dropColumn('idempotency_key');
        });
    }
};

<?php
// app/Services/PaymentService.php - Idempotent payment processing

namespace App\Services;

use App\Models\Payment;
use Illuminate\Support\Facades\DB;
use Illuminate\Support\Str;

class PaymentService
{
    public function processPayment(string $order_id, float $amount, string $idempotency_key = null): Payment
    {
        // Generate idempotency key if not provided
        $idempotency_key = $idempotency_key ?? Str::uuid()->toString();
        
        // Check if this payment was already processed
        $existing = Payment::where('idempotency_key', $idempotency_key)->first();
        if ($existing) {
            \Log::info('Payment already processed, returning existing', [
                'idempotency_key' => $idempotency_key,
                'payment_id' => $existing->id,
            ]);
            return $existing;
        }
        
        // Use database transaction with explicit locking
        return DB::transaction(function() use ($order_id, $amount, $idempotency_key) {
            // Check if order already has successful payment (with row lock)
            $existing_success = Payment::where('order_id', $order_id)
                ->where('status', 'success')
                ->lockForUpdate()  // Critical: Lock row to prevent race condition
                ->first();
            
            if ($existing_success) {
                throw new \Exception("Order {$order_id} already has successful payment");
            }
            
            // Create payment record FIRST (before calling Stripe)
            // This ensures we have database constraint protection
            try {
                $payment = Payment::create([
                    'idempotency_key' => $idempotency_key,
                    'order_id' => $order_id,
                    'amount' => $amount,
                    'status' => 'pending',
                ]);
            } catch (\Illuminate\Database\QueryException $e) {
                // Duplicate idempotency key - payment already processing
                if ($e->getCode() === '23000') {  // Integrity constraint violation
                    \Log::warning('Duplicate payment attempt blocked', [
                        'order_id' => $order_id,
                        'idempotency_key' => $idempotency_key,
                    ]);
                    
                    // Return existing payment
                    return Payment::where('idempotency_key', $idempotency_key)->first();
                }
                throw $e;
            }
            
            // Now process with Stripe
            try {
                $stripe_payment = \Stripe\PaymentIntent::create([
                    'amount' => $amount * 100,
                    'currency' => 'usd',
                    'metadata' => [
                        'order_id' => $order_id,
                        'payment_id' => $payment->id,
                    ],
                ], [
                    'idempotency_key' => $idempotency_key,  // Stripe also supports idempotency
                ]);
                
                $payment->update([
                    'stripe_payment_intent_id' => $stripe_payment->id,
                    'status' => 'success',
                    'completed_at' => now(),
                ]);
                
                return $payment;
                
            } catch (\Exception $e) {
                $payment->update([
                    'status' => 'failed',
                    'error_message' => $e->getMessage(),
                ]);
                throw $e;
            }
        }, 5);  // 5 retry attempts for deadlocks
    }
}

Beyond This Series: Advanced Topics

We've built a production-grade e-commerce platform. Here's what comes next as you scale.

1. Multi-Region Deployment

When your platform grows globally, single-region deployment isn't enough:

Latency: US users experience 200ms+ latency to EU-hosted services
Compliance: GDPR requires EU data stays in EU
Availability: Regional AWS outages happen (us-east-1 outage of December 2021)

Next steps:

Set up Kubernetes clusters in multiple regions (us-east-1, eu-west-1, ap-southeast-1)
Implement global load balancing with Route53 or CloudFlare
Use MySQL read replicas or multi-region databases like Amazon Aurora Global
Implement distributed caching with Redis Cluster across regions

2. Advanced Observability

Production monitoring we've covered is foundational. Advanced observability includes:

Distributed tracing with context propagation across microservices
Real User Monitoring (RUM) to measure actual user experience
Synthetic monitoring to catch issues before users do
Cost attribution per feature/customer using Kubecost

Tools to explore:

OpenTelemetry for standardized observability
Honeycomb.io for high-cardinality observability
Lightstep for service mesh observability

3. Advanced Security Hardening

Security is never finished:

Runtime security with Falco: Detect unexpected behavior in containers
Image scanning: Integrate Trivy or Snyk into CI/CD
Network policies: Restrict pod-to-pod communication
Secrets management: Migrate from Kubernetes Secrets to HashiCorp Vault
mTLS between services: Use service mesh like Istio or Linkerd

4. Cost Optimization at Scale

At scale, cost optimization becomes critical:

Reserved instances / Savings Plans: 40-60% savings for predictable workloads
Spot instances for stateless workloads: 70-90% savings for batch jobs
Database query optimization: Poorly optimized queries cost thousands monthly
CDN usage: Offload static assets to reduce compute costs
Right-sizing based on actual usage patterns

5. Chaos Engineering

Test your disaster recovery before disaster happens:

Pod deletion: Random pod failures using Chaos Mesh
Network latency injection: Simulate slow connections
Resource exhaustion: Test behavior under memory/CPU pressure
Dependency failures: Inject failures in Redis, database, external APIs
Game Days: Scheduled failure testing with full team participation

Start here:

# Install Chaos Mesh
curl -sSL https://mirrors.chaos-mesh.org/latest/install.sh | bash

# Create a simple chaos experiment
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-test
  namespace: ecommerce
spec:
  action: pod-failure
  mode: one
  duration: "60s"
  selector:
    namespaces:
      - ecommerce
    labelSelectors:
      app: laravel
EOF

Final Thoughts

We've covered eight parts and built a production-grade e-commerce platform:

Part 1: Domain-Driven Design foundation
Part 2: Stripe payment integration with webhooks
Part 3: Event-driven architecture with RabbitMQ
Part 4: Background job processing and queues
Part 5: Docker containerization
Part 6: Kubernetes orchestration
Part 7: CI/CD pipeline automation
Part 8: Production deployment and operations

What makes this production-grade:

✅ Comprehensive health checks and observability
✅ Blue-green deployment for zero-downtime releases
✅ Automated backups with tested restore procedures
✅ Incident response runbooks for 2 AM outages
✅ SLO tracking and error budget monitoring
✅ Cost optimization and autoscaling
✅ Lessons learned from real production failures

Your platform is now production-ready, but remember: production readiness is not a destination. It's continuous improvement based on real operational data.

Keep learning:

Monitor your SLOs religiously
Conduct post-mortems after every incident
Test disaster recovery quarterly
Review costs monthly
Update runbooks as systems evolve

The code is at https://github.com/iBekzod/laravel-ecommerce — production-tested patterns you can use today.

Questions or war stories from your production deployments? I'd love to hear them at https://nextgenbeing.com.

Now go build something that scales. 🚀

Articles

Tutorials

Bloggers

Building a Production-Grade E-Commerce Platform with Laravel 12, Stripe, and Kubernetes - Part 8: Production Deployment & Monitoring

Listen to Article

Building a Production-Grade E-Commerce Platform with Laravel 12, Stripe, and Kubernetes - Part 8: Production Deployment & Monitoring

Table of Contents

Introduction: The Final Mile

Pre-Deployment Production Readiness Checklist

1. Health Check Endpoints

2. Production Configuration Validation Script

Blue-Green Deployment Strategy

Complete Blue-Green Implementation

Production Observability Stack

Complete Observability Stack Deployment

Application Instrumentation for Observability

Incident Response Runbook

Incident Classification & Response Times

P0: Critical Incident Response Procedure

Automated Alerting Configuration

Cost Optimization in Production

1. Right-Sizing Resources Based on Actual Usage

2. Implement Horizontal Pod Autoscaler

3. Cluster Autoscaler for Node Management

Performance Monitoring & SLO Tracking

Define SLOs for E-Commerce Platform

Disaster Recovery & Business Continuity

Automated Database Backup System

Lessons Learned from Production Outages

Lesson 1: Connection Pool Exhaustion (Black Friday 2023)

Lesson 2: Cache Stampede During Deployment

Lesson 3: Silent Data Corruption from Race Condition

Beyond This Series: Advanced Topics

1. Multi-Region Deployment

2. Advanced Observability

3. Advanced Security Hardening

4. Cost Optimization at Scale

5. Chaos Engineering

Final Thoughts

Daniel Hartwell

Never Miss an Article

Comments (0)

Related Articles

Building a Modern SaaS Application with Laravel - Part 3: Advanced Features & Configuration

Building a Modern SaaS Application with Laravel - Part 1: Architecture, Setup & Foundations

Optimizing Database Performance with Indexing and Caching: What We Learned Scaling to 100M Queries/Day

Building a Production-Grade E-Commerce Platform with Laravel 12, Stripe, and Kubernetes - Complete 8-Part Production Guide

Building a Production-Grade E-Commerce Platform with Laravel 12, Stripe, and Kubernetes - Part 1: Architecture, Setup & Foundations

Building a Production-Grade E-Commerce Platform with Laravel 12, Stripe, and Kubernetes - Part 2: Core Implementation & Design Patterns

Building a Production-Grade E-Commerce Platform with Laravel 12, Stripe, and Kubernetes - Part 3: Advanced Features & Configuration

Building a Production-Grade E-Commerce Platform with Laravel 12, Stripe, and Kubernetes - Part 4: Integration & Third-party Services

Building a Production-Grade E-Commerce Platform with Laravel 12, Stripe, and Kubernetes - Part 5: Containerization & Deployment

Building a Production-Grade E-Commerce Platform with Laravel 12, Stripe, and Kubernetes - Part 6: Scaling, Performance & Optimization

Building a Production-Grade E-Commerce Platform with Laravel 12, Stripe, and Kubernetes - Part 7: Testing, Security & Best Practices

Building a Production-Grade E-Commerce Platform with Laravel 12, Stripe, and Kubernetes - Part 8: Production Deployment & Monitoring

Cookie & Ad Consent