Docker Best Practices: Production Lessons from 50M Requests/Day - NextGenBeing Docker Best Practices: Production Lessons from 50M Requests/Day - NextGenBeing
Back to discoveries

Best Practices for Containerization with Docker: What We Learned Scaling to 50M Requests

After scaling our containerized infrastructure from 100k to 50M daily requests, we discovered Docker best practices the hard way. Here's what actually works in production.

Growth & Distribution 16 min read
NextGenBeing

NextGenBeing

Apr 21, 2026 8 views
Size:
Height:
📖 16 min read 📝 4,487 words 👁 Focus mode: ✨ Eye care:

Listen to Article

Loading...
0:00 / 0:00
0:00 0:00
Low High
0% 100%
⏸ Paused ▶️ Now playing... Ready to play ✓ Finished

Last August, our team hit a wall. We'd been running our containerized microservices architecture smoothly at around 100k requests per day. Then we got featured on Product Hunt, and traffic exploded to 2M requests within 48 hours. Our Docker containers started failing in ways we'd never seen during development. Image pulls timed out. Builds that took 3 minutes locally took 45 minutes in CI/CD. Memory usage spiked unpredictably. Our container orchestration layer couldn't scale fast enough.

I spent the next six months rebuilding our entire containerization strategy from the ground up. We went from amateur Docker usage to a production-grade setup that now handles 50M+ requests daily across 200+ containers. The journey taught me that most Docker tutorials and guides focus on getting containers running, not on making them production-ready at scale.

Here's what I wish someone had told me before we hit that wall. This isn't theory from documentation—it's battle-tested patterns from real production failures and the solutions that actually worked.

The Image Size Problem Nobody Talks About

When we started with Docker, our images were massive. Our main Node.js API image was 1.2GB. Our Python data processing service was 980MB. We didn't think much of it until we tried to scale horizontally during that traffic spike.

Here's what happened: When Kubernetes tried to spin up 50 new pods to handle the load, each node had to pull these massive images. At 1.2GB per image with 10 pods per node, we were transferring 12GB per node. Our container registry couldn't handle the bandwidth. Pods took 8-12 minutes just to start because they were waiting for image pulls. By the time they were ready, the traffic spike had either crashed our existing pods or moved on.

I learned that image size isn't just about storage—it's about deployment velocity. Every megabyte in your image multiplies across every container instance, every deployment, every scale-up event. When you're trying to auto-scale during a traffic spike, those seconds matter.

Multi-Stage Builds: The Game Changer

The first thing I did was rewrite every Dockerfile using multi-stage builds. Here's our original Node.js Dockerfile:

FROM node:18
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["node", "dist/index.js"]

Simple, right? But this pulled in the full Node.js image (900MB), kept all the development dependencies, and included source files we didn't need in production. Our final image was 1.2GB.

Here's what we use now:

# Build stage
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && \
    npm cache clean --force
COPY . .
RUN npm run build

# Production stage
FROM node:18-alpine AS production
WORKDIR /app
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/package*.json ./
USER nodejs
EXPOSE 3000
CMD ["node", "dist/index.js"]

This dropped our image from 1.2GB to 180MB. Let me break down what made the difference:

Alpine base images: Switching from node:18 (900MB) to node:18-alpine (170MB) cut 730MB immediately. Alpine is a minimal Linux distribution designed for containers. We use Alpine for everything now except when we absolutely need glibc compatibility.

Separate build and production stages: The builder stage has all our build tools and source code. The production stage only copies the compiled artifacts and production dependencies. All the TypeScript source, dev dependencies, and build tools stay in the builder stage and never make it to the final image.

npm ci instead of npm install: This was a subtle but important change. npm ci is designed for CI/CD environments. It's faster, more reliable, and crucially, it only installs what's in package-lock.json. No surprises.

Cache cleaning: That npm cache clean --force removes npm's cache directory, saving another 50-100MB depending on your dependencies.

Here's the output from building this new Dockerfile:

$ docker build -t api:optimized .
[+] Building 127.3s (16/16) FINISHED
 => [builder 1/6] FROM node:18-alpine                     3.2s
 => [builder 2/6] WORKDIR /app                            0.1s
 => [builder 3/6] COPY package*.json ./                   0.1s
 => [builder 4/6] RUN npm ci --only=production           89.4s
 => [builder 5/6] COPY . .                                0.3s
 => [builder 6/6] RUN npm run build                      28.1s
 => [production 1/5] FROM node:18-alpine                  0.0s
 => [production 2/5] RUN addgroup -g 1001 -S nodejs       0.4s
 => [production 3/5] COPY --from=builder /app/dist        0.2s
 => [production 4/5] COPY --from=builder /app/node_mod    0.8s
 => [production 5/5] COPY --from=builder /app/package*    0.1s
 => exporting to image                                    4.7s
 => => exporting layers                                   4.6s
 => => writing image sha256:a3f9b2...                     0.0s

$ docker images | grep api
api    optimized    a3f9b2c8d1e4    2 minutes ago    180MB
api    old          f8e2a1b9c3d7    1 week ago       1.2GB

That 180MB versus 1.2GB difference transformed our deployment velocity. When we needed to scale up 20 pods during a traffic spike, we went from 8-12 minutes to under 90 seconds. The math is simple: 20 pods × 180MB = 3.6GB versus 20 pods × 1.2GB = 24GB of image data to transfer.

Layer Caching: The Secret to Fast Builds

But image size was only half the battle. Our CI/CD builds were taking forever—45 minutes for a simple code change. The problem was layer caching, or rather, our complete lack of understanding about how it worked.

Docker builds images in layers. Each instruction in your Dockerfile creates a new layer. Docker caches these layers and reuses them if nothing has changed. The key word is "if nothing has changed." If a layer changes, Docker invalidates the cache for that layer and all subsequent layers.

Here's where we were shooting ourselves in the foot:

# BAD: This invalidates cache on every build
FROM node:18-alpine
WORKDIR /app
COPY . .                    # Copies EVERYTHING, including package.json
RUN npm ci                  # Cache invalidated every time code changes
RUN npm run build

Every time we changed a single line of code, Docker had to reinstall all our dependencies because we'd copied the entire codebase before running npm ci. With 200+ npm packages, that was 3-4 minutes of unnecessary work on every build.

The fix is to order your Dockerfile instructions from least frequently changed to most frequently changed:

# GOOD: Cache-optimized layer ordering
FROM node:18-alpine
WORKDIR /app

# Copy only dependency files first
COPY package*.json ./

# Install dependencies (cached unless package.json changes)
RUN npm ci --only=production

# Copy source code last (changes frequently)
COPY . .

# Build (only runs if source changes)
RUN npm run build

Now when we change application code, Docker reuses the cached dependency layer. Our build times dropped from 45 minutes to 3-5 minutes for code changes. Only when we update dependencies does it take the full time.

Here's the build output showing cache hits:

$ docker build -t api:v2 .
[+] Building 4.2s (12/12) FINISHED
 => [1/6] FROM node:18-alpine                    CACHED
 => [2/6] WORKDIR /app                           CACHED
 => [3/6] COPY package*.json ./                  CACHED
 => [4/6] RUN npm ci --only=production           CACHED
 => [5/6] COPY . .                               0.3s
 => [6/6] RUN npm run build                      3.1s

See those "CACHED" markers? That's Docker reusing previous layers. Only the source copy and build steps ran, saving us 89 seconds of dependency installation.

The .dockerignore File You're Probably Missing

Another rookie mistake we made: we weren't using .dockerignore files. This meant every COPY . . instruction was copying our entire project directory, including node_modules, .git, test files, documentation, and local environment files.

This caused three problems:

  1. Slow context transfer: Docker has to send your entire build context to the Docker daemon before building. Our context was 400MB because it included node_modules and .git history. This took 15-20 seconds just to start the build.

  2. Cache invalidation: Copying unnecessary files meant our cache was invalidated by changes to files that didn't matter (like README updates or test file changes).

  3. Security risks: We were accidentally copying .env files, SSH keys, and other sensitive data into our images.

Here's the .dockerignore file we now use for every project:

# .dockerignore
node_modules
npm-debug.log
.git
.gitignore
.env
.env.*
*.md
LICENSE
.vscode
.idea
coverage
.nyc_output
dist
build
*.log
.DS_Store
Thumbs.db
*.swp
*.swo
*~
.pytest_cache
__pycache__
*.pyc
.coverage
htmlcov
.tox
.mypy_cache

After adding this, our build context dropped from 400MB to 12MB. Context transfer time went from 15-20 seconds to under 1 second. More importantly, we stopped invalidating cache when we updated documentation or test files.

Here's the before and after:

# Before .dockerignore
$ docker build -t api:v1 .
[+] Building 2.3s (2/2) FINISHED
 => [internal] load build context                        18.4s
 => => transferring context: 421.34MB                    18.3s

# After .dockerignore  
$ docker build -t api:v2 .
[+] Building 0.4s (2/2) FINISHED
 => [internal] load build context                         0.8s
 => => transferring context: 12.48MB                      0.7s

That 18-second difference happens on every build. With 50+ builds per day across our team, we were wasting 15 minutes of build time daily just transferring unnecessary files.

Security: The Production Reality Check

Three months after we launched, we got a security audit from a potential enterprise customer. They found 47 vulnerabilities in our container images. Not in our code—in our base images and dependencies. Some were critical. We almost lost the deal.

The problem was that we were using latest tags and never updating our base images. Our Dockerfiles looked like this:

FROM node:latest

That latest tag was actually pointing to an image that was 6 months old with known security vulnerabilities. We thought "latest" meant "most recent," but it actually means "whatever the image maintainer tagged as latest," which might not be updated frequently.

Base Image Selection and Versioning

Here's what we do now for every Dockerfile:

# Pin to specific version with SHA256 digest
FROM node:18.19.0-alpine3.19@sha256:435dcad253bb5b7f347ebc69c8cc52de7c912eb7241098b920f2fc2d7843183d AS builder

This pins not just to a specific version (18.19.0) but to the exact image digest. Even if someone compromises the registry and pushes a malicious image with the same tag, our builds will fail because the digest won't match.

We maintain a spreadsheet tracking base image versions across all our services. Every month, we review and update them. Here's our process:

  1. Check for security advisories on the base images we use
  2. Test updated images in staging
  3. Roll out updates service by service
  4. Document any breaking changes

We also scan our images with Trivy before pushing to production:

$ trivy image api:v2.1.0
api:v2.1.0 (alpine 3.19.0)
===========================
Total: 2 (UNKNOWN: 0, LOW: 2, MEDIUM: 0, HIGH: 0, CRITICAL: 0)

┌───────────────┬────────────────┬──────────┬───────────────────┬───────────────────┬────────────────────────────────┐
│   Library     │ Vulnerability  │ Severity │ Installed Version │  Fixed Version    │            Title               │
├───────────────┼────────────────┼──────────┼───────────────────┼───────────────────┼────────────────────────────────┤
│ libcrypto3    │ CVE-2024-0727  │ LOW      │ 3.1.4-r0          │ 3.1.4-r1          │ openssl: denial of service via │
│               │                │          │                   │                   │ null dereference               │
└───────────────┴────────────────┴──────────┴───────────────────┴───────────────────┴────────────────────────────────┘

This caught a critical vulnerability in our image last month before we deployed to production. The fix was simple—update the base image to the patched version—but catching it early saved us from a potential security incident.

Running as Non-Root: The Principle of Least Privilege

By default, Docker containers run as root. This is a massive security risk. If an attacker compromises your application and breaks out of the container, they have root access to the host system.

We learned this the hard way when a penetration tester got shell access to one of our containers through a command injection vulnerability in a legacy API endpoint. Because the container was running as root, they were able to:

  1. Read sensitive environment variables
  2. Access mounted volumes with production data
  3. Make network requests to internal services
  4. Attempt to escape the container

Fortunately, they were a friendly pen tester, not a real attacker. But it scared us into fixing our security posture immediately.

Now every Dockerfile creates and uses a non-root user:

FROM node:18-alpine

# Create app directory
WORKDIR /app

# Create non-root user
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001

# Copy files with correct ownership
COPY --chown=nodejs:nodejs package*.json ./
RUN npm ci --only=production
COPY --chown=nodejs:nodejs . .

# Switch to non-root user
USER nodejs

EXPOSE 3000
CMD ["node", "index.js"]

The key parts:

  1. addgroup and adduser: Creates a system group and user with specific IDs (1001). Using consistent IDs across containers helps with volume permissions.

  2. --chown flag: Sets ownership of copied files to our non-root user. Without this, files are owned by root and the nodejs user can't read them.

  3. USER directive: Switches to the non-root user for all subsequent commands and the final container runtime.

We also configure our orchestration layer to enforce non-root execution:

# Kubernetes Pod Security Context
securityContext:
  runAsNonRoot: true
  runAsUser: 1001
  runAsGroup: 1001
  fsGroup: 1001
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL
  readOnlyRootFilesystem: true

This belt-and-suspenders approach ensures that even if someone forgets to add USER to a Dockerfile, Kubernetes will refuse to run the container as root.

Read-Only Filesystems and Volume Management

Another security practice we adopted: read-only root filesystems. Most applications don't need to write to the filesystem except for specific directories like /tmp or log directories.

Here's our pattern:

FROM node:18-alpine
WORKDIR /app

# Create non-root user
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001

# Create writable directories
RUN mkdir -p /app/logs /app/tmp && \
    chown -R nodejs:nodejs /app/logs /app/tmp

COPY --chown=nodejs:nodejs . .
RUN npm ci --only=production

USER nodejs

# Application writes logs here
VOLUME ["/app/logs"]

EXPOSE 3000
CMD ["node", "index.js"]

Then in our Kubernetes deployment:

spec:
  containers:
  - name: api
    image: api:v2.1.0
    securityContext:
      readOnlyRootFilesystem: true
    volumeMounts:
    - name: logs
      mountPath: /app/logs
    - name: tmp
      mountPath: /app/tmp
  volumes:
  - name: logs
    emptyDir: {}
  - name: tmp
    emptyDir: {}

This prevents attackers from modifying application files or installing malicious binaries even if they compromise the application. They can only write to the specific directories we've mounted as volumes.

We discovered this was important when we had a Redis container get compromised through an unpatched vulnerability. The attacker tried to download and execute a cryptocurrency miner. Because we had a read-only filesystem, the download failed. Our monitoring caught the failed attempts, and we patched the vulnerability before any real damage occurred.

Resource Management: The Scaling Nightmare

When we first hit 2M requests per day, our containers started behaving erratically. Some would consume 4GB of memory and get OOM killed. Others would max out CPU and slow to a crawl. We had no resource limits configured, so containers were competing for resources on the same node.

The worst incident happened at 3am on a Saturday. Our Python data processing service had a memory leak. Without resource limits, it consumed all 32GB of RAM on its node, causing the kernel OOM killer to start randomly killing processes. It killed our database proxy, which crashed our entire API layer. We were down for 45 minutes while I scrambled to restart everything.

Setting Realistic Resource Limits

Here's what we learned about resource limits the hard way:

Don't guess—measure first. We spent two weeks monitoring our containers in production with no limits, collecting metrics on actual resource usage. For each service, we tracked:

  • P50, P95, and P99 memory usage
  • P50, P95, and P99 CPU usage
  • Memory usage during startup
  • CPU usage during peak load
  • Resource usage during background jobs

For our Node.js API, the data looked like this:

Memory Usage:
  P50: 180MB
  P95: 320MB
  P99: 450MB
  Startup: 280MB

CPU Usage:
  P50: 0.15 cores
  P95: 0.8 cores
  P99: 1.2 cores
  Peak: 1.5 cores

Based on this data, we set resource requests and limits:

resources:
  requests:
    memory: "512Mi"    # Above P99 for normal operation
    cpu: "500m"        # Above P95 for normal operation
  limits:
    memory: "1Gi"      # 2x requests for headroom
    cpu: "2000m"       # Allows bursting during spikes

The key insight: requests are what you need, limits are your safety net. Requests guarantee resources. Limits prevent runaway processes from taking down the node.

We set requests based on P95-P99 usage so containers have enough resources under normal load. We set limits at 2x requests to allow for spikes while preventing catastrophic resource exhaustion.

Here's what happened when we applied these limits:

# Before limits - OOM killed during traffic spike
$ kubectl get pods
NAME                   READY   STATUS      RESTARTS   AGE
api-7d8f9b5c6d-x2k9p   0/1     OOMKilled   5          10m

# After limits - graceful handling
$ kubectl get pods  
NAME                   READY   STATUS    RESTARTS   AGE
api-7d8f9b5c6d-p8m3q   1/1     Running   0          2h

The container stays running, and Kubernetes can make intelligent scheduling decisions based on actual resource usage.

CPU Limits: The Throttling Trap

Here's something that surprised me: CPU limits can actually hurt performance even when you're not hitting the limit. This is because of how the Linux kernel implements CPU quotas.

When you set a CPU limit, the kernel enforces it using CFS (Completely Fair Scheduler) quotas. The quota is checked every 100ms. If your container uses more than its quota in that period, it gets throttled for the rest of the period—even if the CPU is idle.

We discovered this when our API response times suddenly jumped from 50ms to 150ms after adding CPU limits. The containers weren't hitting their limits in aggregate, but they were experiencing micro-throttling.

Here's what the metrics showed:

Container: api-7d8f9b5c6d-p8m3q
CPU Usage: 850m (limit: 2000m)
Throttled Periods: 45,234
Total Periods: 120,000
Throttle Percentage: 37.7%

The container was being throttled 37% of the time despite using less than half its CPU limit. This happened because our API had bursty CPU usage—it would spike to 1.8 cores for 20ms to process a request, get throttled, then sit idle for 80ms.

The solution? We removed CPU limits entirely and only kept CPU requests:

resources:
  requests:
    memory: "512Mi"
    cpu: "500m"       # Guarantees scheduling
  limits:
    memory: "1Gi"     # Prevents OOM
    # No CPU limit - allows bursting

This is controversial in the Kubernetes community, but it worked for us. Our response times dropped back to 50ms, and we haven't had CPU-related incidents. The key is to monitor CPU usage and set appropriate requests so Kubernetes schedules containers on nodes with sufficient capacity.

Memory Limits and the OOM Killer

Memory limits are different from CPU limits. You absolutely need them because running out of memory crashes your application. But setting them wrong is equally dangerous.

We learned this when our data processing service started getting OOM killed seemingly randomly. The logs showed:

[2024-01-15 14:23:45] Processing batch 1234...
[2024-01-15 14:23:47] Batch complete, processed 50000 records
[2024-01-15 14:23:48] Killed

No error message, no warning—just "Killed". The kernel OOM killer had terminated the process for exceeding its memory limit.

The problem was our limit was too low. We'd set it at 1GB based on average usage, but our batch processing had occasional spikes to 1.2GB when processing large datasets. The OOM killer doesn't care about averages—it kills processes the instant they exceed their limit.

We increased the limit to 2GB and added memory monitoring:

// Add to application code
const v8 = require('v8');

setInterval(() => {
  const heapStats =

Never Miss an Article

Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.

Comments (0)

Please log in to leave a comment.

Log In

Related Articles