NextGenBeing
Listen to Article
Loading...Last August, our team hit a wall. We'd been running our containerized microservices architecture smoothly at around 100k requests per day. Then we got featured on Product Hunt, and traffic exploded to 2M requests within 48 hours. Our Docker containers started failing in ways we'd never seen during development. Image pulls timed out. Builds that took 3 minutes locally took 45 minutes in CI/CD. Memory usage spiked unpredictably. Our container orchestration layer couldn't scale fast enough.
I spent the next six months rebuilding our entire containerization strategy from the ground up. We went from amateur Docker usage to a production-grade setup that now handles 50M+ requests daily across 200+ containers. The journey taught me that most Docker tutorials and guides focus on getting containers running, not on making them production-ready at scale.
Here's what I wish someone had told me before we hit that wall. This isn't theory from documentation—it's battle-tested patterns from real production failures and the solutions that actually worked.
The Image Size Problem Nobody Talks About
When we started with Docker, our images were massive. Our main Node.js API image was 1.2GB. Our Python data processing service was 980MB. We didn't think much of it until we tried to scale horizontally during that traffic spike.
Here's what happened: When Kubernetes tried to spin up 50 new pods to handle the load, each node had to pull these massive images. At 1.2GB per image with 10 pods per node, we were transferring 12GB per node. Our container registry couldn't handle the bandwidth. Pods took 8-12 minutes just to start because they were waiting for image pulls. By the time they were ready, the traffic spike had either crashed our existing pods or moved on.
I learned that image size isn't just about storage—it's about deployment velocity. Every megabyte in your image multiplies across every container instance, every deployment, every scale-up event. When you're trying to auto-scale during a traffic spike, those seconds matter.
Multi-Stage Builds: The Game Changer
The first thing I did was rewrite every Dockerfile using multi-stage builds. Here's our original Node.js Dockerfile:
FROM node:18
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["node", "dist/index.js"]
Simple, right? But this pulled in the full Node.js image (900MB), kept all the development dependencies, and included source files we didn't need in production. Our final image was 1.2GB.
Here's what we use now:
# Build stage
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && \
npm cache clean --force
COPY . .
RUN npm run build
# Production stage
FROM node:18-alpine AS production
WORKDIR /app
RUN addgroup -g 1001 -S nodejs && \
adduser -S nodejs -u 1001
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/package*.json ./
USER nodejs
EXPOSE 3000
CMD ["node", "dist/index.js"]
This dropped our image from 1.2GB to 180MB. Let me break down what made the difference:
Alpine base images: Switching from node:18 (900MB) to node:18-alpine (170MB) cut 730MB immediately. Alpine is a minimal Linux distribution designed for containers. We use Alpine for everything now except when we absolutely need glibc compatibility.
Separate build and production stages: The builder stage has all our build tools and source code. The production stage only copies the compiled artifacts and production dependencies. All the TypeScript source, dev dependencies, and build tools stay in the builder stage and never make it to the final image.
npm ci instead of npm install: This was a subtle but important change. npm ci is designed for CI/CD environments. It's faster, more reliable, and crucially, it only installs what's in package-lock.json. No surprises.
Cache cleaning: That npm cache clean --force removes npm's cache directory, saving another 50-100MB depending on your dependencies.
Here's the output from building this new Dockerfile:
$ docker build -t api:optimized .
[+] Building 127.3s (16/16) FINISHED
=> [builder 1/6] FROM node:18-alpine 3.2s
=> [builder 2/6] WORKDIR /app 0.1s
=> [builder 3/6] COPY package*.json ./ 0.1s
=> [builder 4/6] RUN npm ci --only=production 89.4s
=> [builder 5/6] COPY . . 0.3s
=> [builder 6/6] RUN npm run build 28.1s
=> [production 1/5] FROM node:18-alpine 0.0s
=> [production 2/5] RUN addgroup -g 1001 -S nodejs 0.4s
=> [production 3/5] COPY --from=builder /app/dist 0.2s
=> [production 4/5] COPY --from=builder /app/node_mod 0.8s
=> [production 5/5] COPY --from=builder /app/package* 0.1s
=> exporting to image 4.7s
=> => exporting layers 4.6s
=> => writing image sha256:a3f9b2... 0.0s
$ docker images | grep api
api optimized a3f9b2c8d1e4 2 minutes ago 180MB
api old f8e2a1b9c3d7 1 week ago 1.2GB
That 180MB versus 1.2GB difference transformed our deployment velocity. When we needed to scale up 20 pods during a traffic spike, we went from 8-12 minutes to under 90 seconds. The math is simple: 20 pods × 180MB = 3.6GB versus 20 pods × 1.2GB = 24GB of image data to transfer.
Layer Caching: The Secret to Fast Builds
But image size was only half the battle. Our CI/CD builds were taking forever—45 minutes for a simple code change. The problem was layer caching, or rather, our complete lack of understanding about how it worked.
Docker builds images in layers. Each instruction in your Dockerfile creates a new layer. Docker caches these layers and reuses them if nothing has changed. The key word is "if nothing has changed." If a layer changes, Docker invalidates the cache for that layer and all subsequent layers.
Here's where we were shooting ourselves in the foot:
# BAD: This invalidates cache on every build
FROM node:18-alpine
WORKDIR /app
COPY . . # Copies EVERYTHING, including package.json
RUN npm ci # Cache invalidated every time code changes
RUN npm run build
Every time we changed a single line of code, Docker had to reinstall all our dependencies because we'd copied the entire codebase before running npm ci. With 200+ npm packages, that was 3-4 minutes of unnecessary work on every build.
The fix is to order your Dockerfile instructions from least frequently changed to most frequently changed:
# GOOD: Cache-optimized layer ordering
FROM node:18-alpine
WORKDIR /app
# Copy only dependency files first
COPY package*.json ./
# Install dependencies (cached unless package.json changes)
RUN npm ci --only=production
# Copy source code last (changes frequently)
COPY . .
# Build (only runs if source changes)
RUN npm run build
Now when we change application code, Docker reuses the cached dependency layer. Our build times dropped from 45 minutes to 3-5 minutes for code changes. Only when we update dependencies does it take the full time.
Here's the build output showing cache hits:
$ docker build -t api:v2 .
[+] Building 4.2s (12/12) FINISHED
=> [1/6] FROM node:18-alpine CACHED
=> [2/6] WORKDIR /app CACHED
=> [3/6] COPY package*.json ./ CACHED
=> [4/6] RUN npm ci --only=production CACHED
=> [5/6] COPY . . 0.3s
=> [6/6] RUN npm run build 3.1s
See those "CACHED" markers? That's Docker reusing previous layers. Only the source copy and build steps ran, saving us 89 seconds of dependency installation.
The .dockerignore File You're Probably Missing
Another rookie mistake we made: we weren't using .dockerignore files. This meant every COPY . . instruction was copying our entire project directory, including node_modules, .git, test files, documentation, and local environment files.
This caused three problems:
-
Slow context transfer: Docker has to send your entire build context to the Docker daemon before building. Our context was 400MB because it included node_modules and .git history. This took 15-20 seconds just to start the build.
-
Cache invalidation: Copying unnecessary files meant our cache was invalidated by changes to files that didn't matter (like README updates or test file changes).
-
Security risks: We were accidentally copying .env files, SSH keys, and other sensitive data into our images.
Here's the .dockerignore file we now use for every project:
# .dockerignore
node_modules
npm-debug.log
.git
.gitignore
.env
.env.*
*.md
LICENSE
.vscode
.idea
coverage
.nyc_output
dist
build
*.log
.DS_Store
Thumbs.db
*.swp
*.swo
*~
.pytest_cache
__pycache__
*.pyc
.coverage
htmlcov
.tox
.mypy_cache
After adding this, our build context dropped from 400MB to 12MB. Context transfer time went from 15-20 seconds to under 1 second. More importantly, we stopped invalidating cache when we updated documentation or test files.
Here's the before and after:
# Before .dockerignore
$ docker build -t api:v1 .
[+] Building 2.3s (2/2) FINISHED
=> [internal] load build context 18.4s
=> => transferring context: 421.34MB 18.3s
# After .dockerignore
$ docker build -t api:v2 .
[+] Building 0.4s (2/2) FINISHED
=> [internal] load build context 0.8s
=> => transferring context: 12.48MB 0.7s
That 18-second difference happens on every build. With 50+ builds per day across our team, we were wasting 15 minutes of build time daily just transferring unnecessary files.
Security: The Production Reality Check
Three months after we launched, we got a security audit from a potential enterprise customer. They found 47 vulnerabilities in our container images. Not in our code—in our base images and dependencies. Some were critical. We almost lost the deal.
The problem was that we were using latest tags and never updating our base images. Our Dockerfiles looked like this:
FROM node:latest
That latest tag was actually pointing to an image that was 6 months old with known security vulnerabilities. We thought "latest" meant "most recent," but it actually means "whatever the image maintainer tagged as latest," which might not be updated frequently.
Base Image Selection and Versioning
Here's what we do now for every Dockerfile:
# Pin to specific version with SHA256 digest
FROM node:18.19.0-alpine3.19@sha256:435dcad253bb5b7f347ebc69c8cc52de7c912eb7241098b920f2fc2d7843183d AS builder
This pins not just to a specific version (18.19.0) but to the exact image digest. Even if someone compromises the registry and pushes a malicious image with the same tag, our builds will fail because the digest won't match.
We maintain a spreadsheet tracking base image versions across all our services. Every month, we review and update them. Here's our process:
- Check for security advisories on the base images we use
- Test updated images in staging
- Roll out updates service by service
- Document any breaking changes
We also scan our images with Trivy before pushing to production:
$ trivy image api:v2.1.0
api:v2.1.0 (alpine 3.19.0)
===========================
Total: 2 (UNKNOWN: 0, LOW: 2, MEDIUM: 0, HIGH: 0, CRITICAL: 0)
┌───────────────┬────────────────┬──────────┬───────────────────┬───────────────────┬────────────────────────────────┐
│ Library │ Vulnerability │ Severity │ Installed Version │ Fixed Version │ Title │
├───────────────┼────────────────┼──────────┼───────────────────┼───────────────────┼────────────────────────────────┤
│ libcrypto3 │ CVE-2024-0727 │ LOW │ 3.1.4-r0 │ 3.1.4-r1 │ openssl: denial of service via │
│ │ │ │ │ │ null dereference │
└───────────────┴────────────────┴──────────┴───────────────────┴───────────────────┴────────────────────────────────┘
This caught a critical vulnerability in our image last month before we deployed to production. The fix was simple—update the base image to the patched version—but catching it early saved us from a potential security incident.
Running as Non-Root: The Principle of Least Privilege
By default, Docker containers run as root. This is a massive security risk. If an attacker compromises your application and breaks out of the container, they have root access to the host system.
We learned this the hard way when a penetration tester got shell access to one of our containers through a command injection vulnerability in a legacy API endpoint. Because the container was running as root, they were able to:
- Read sensitive environment variables
- Access mounted volumes with production data
- Make network requests to internal services
- Attempt to escape the container
Fortunately, they were a friendly pen tester, not a real attacker. But it scared us into fixing our security posture immediately.
Now every Dockerfile creates and uses a non-root user:
FROM node:18-alpine
# Create app directory
WORKDIR /app
# Create non-root user
RUN addgroup -g 1001 -S nodejs && \
adduser -S nodejs -u 1001
# Copy files with correct ownership
COPY --chown=nodejs:nodejs package*.json ./
RUN npm ci --only=production
COPY --chown=nodejs:nodejs . .
# Switch to non-root user
USER nodejs
EXPOSE 3000
CMD ["node", "index.js"]
The key parts:
-
addgroup and adduser: Creates a system group and user with specific IDs (1001). Using consistent IDs across containers helps with volume permissions.
-
--chown flag: Sets ownership of copied files to our non-root user. Without this, files are owned by root and the nodejs user can't read them.
-
USER directive: Switches to the non-root user for all subsequent commands and the final container runtime.
We also configure our orchestration layer to enforce non-root execution:
# Kubernetes Pod Security Context
securityContext:
runAsNonRoot: true
runAsUser: 1001
runAsGroup: 1001
fsGroup: 1001
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
This belt-and-suspenders approach ensures that even if someone forgets to add USER to a Dockerfile, Kubernetes will refuse to run the container as root.
Read-Only Filesystems and Volume Management
Another security practice we adopted: read-only root filesystems. Most applications don't need to write to the filesystem except for specific directories like /tmp or log directories.
Here's our pattern:
FROM node:18-alpine
WORKDIR /app
# Create non-root user
RUN addgroup -g 1001 -S nodejs && \
adduser -S nodejs -u 1001
# Create writable directories
RUN mkdir -p /app/logs /app/tmp && \
chown -R nodejs:nodejs /app/logs /app/tmp
COPY --chown=nodejs:nodejs . .
RUN npm ci --only=production
USER nodejs
# Application writes logs here
VOLUME ["/app/logs"]
EXPOSE 3000
CMD ["node", "index.js"]
Then in our Kubernetes deployment:
spec:
containers:
- name: api
image: api:v2.1.0
securityContext:
readOnlyRootFilesystem: true
volumeMounts:
- name: logs
mountPath: /app/logs
- name: tmp
mountPath: /app/tmp
volumes:
- name: logs
emptyDir: {}
- name: tmp
emptyDir: {}
This prevents attackers from modifying application files or installing malicious binaries even if they compromise the application. They can only write to the specific directories we've mounted as volumes.
We discovered this was important when we had a Redis container get compromised through an unpatched vulnerability. The attacker tried to download and execute a cryptocurrency miner. Because we had a read-only filesystem, the download failed. Our monitoring caught the failed attempts, and we patched the vulnerability before any real damage occurred.
Resource Management: The Scaling Nightmare
When we first hit 2M requests per day, our containers started behaving erratically. Some would consume 4GB of memory and get OOM killed. Others would max out CPU and slow to a crawl. We had no resource limits configured, so containers were competing for resources on the same node.
The worst incident happened at 3am on a Saturday. Our Python data processing service had a memory leak. Without resource limits, it consumed all 32GB of RAM on its node, causing the kernel OOM killer to start randomly killing processes. It killed our database proxy, which crashed our entire API layer. We were down for 45 minutes while I scrambled to restart everything.
Setting Realistic Resource Limits
Here's what we learned about resource limits the hard way:
Don't guess—measure first. We spent two weeks monitoring our containers in production with no limits, collecting metrics on actual resource usage. For each service, we tracked:
- P50, P95, and P99 memory usage
- P50, P95, and P99 CPU usage
- Memory usage during startup
- CPU usage during peak load
- Resource usage during background jobs
For our Node.js API, the data looked like this:
Memory Usage:
P50: 180MB
P95: 320MB
P99: 450MB
Startup: 280MB
CPU Usage:
P50: 0.15 cores
P95: 0.8 cores
P99: 1.2 cores
Peak: 1.5 cores
Based on this data, we set resource requests and limits:
resources:
requests:
memory: "512Mi" # Above P99 for normal operation
cpu: "500m" # Above P95 for normal operation
limits:
memory: "1Gi" # 2x requests for headroom
cpu: "2000m" # Allows bursting during spikes
The key insight: requests are what you need, limits are your safety net. Requests guarantee resources. Limits prevent runaway processes from taking down the node.
We set requests based on P95-P99 usage so containers have enough resources under normal load. We set limits at 2x requests to allow for spikes while preventing catastrophic resource exhaustion.
Here's what happened when we applied these limits:
# Before limits - OOM killed during traffic spike
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
api-7d8f9b5c6d-x2k9p 0/1 OOMKilled 5 10m
# After limits - graceful handling
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
api-7d8f9b5c6d-p8m3q 1/1 Running 0 2h
The container stays running, and Kubernetes can make intelligent scheduling decisions based on actual resource usage.
CPU Limits: The Throttling Trap
Here's something that surprised me: CPU limits can actually hurt performance even when you're not hitting the limit. This is because of how the Linux kernel implements CPU quotas.
When you set a CPU limit, the kernel enforces it using CFS (Completely Fair Scheduler) quotas. The quota is checked every 100ms. If your container uses more than its quota in that period, it gets throttled for the rest of the period—even if the CPU is idle.
We discovered this when our API response times suddenly jumped from 50ms to 150ms after adding CPU limits. The containers weren't hitting their limits in aggregate, but they were experiencing micro-throttling.
Here's what the metrics showed:
Container: api-7d8f9b5c6d-p8m3q
CPU Usage: 850m (limit: 2000m)
Throttled Periods: 45,234
Total Periods: 120,000
Throttle Percentage: 37.7%
The container was being throttled 37% of the time despite using less than half its CPU limit. This happened because our API had bursty CPU usage—it would spike to 1.8 cores for 20ms to process a request, get throttled, then sit idle for 80ms.
The solution? We removed CPU limits entirely and only kept CPU requests:
resources:
requests:
memory: "512Mi"
cpu: "500m" # Guarantees scheduling
limits:
memory: "1Gi" # Prevents OOM
# No CPU limit - allows bursting
This is controversial in the Kubernetes community, but it worked for us. Our response times dropped back to 50ms, and we haven't had CPU-related incidents. The key is to monitor CPU usage and set appropriate requests so Kubernetes schedules containers on nodes with sufficient capacity.
Memory Limits and the OOM Killer
Memory limits are different from CPU limits. You absolutely need them because running out of memory crashes your application. But setting them wrong is equally dangerous.
We learned this when our data processing service started getting OOM killed seemingly randomly. The logs showed:
[2024-01-15 14:23:45] Processing batch 1234...
[2024-01-15 14:23:47] Batch complete, processed 50000 records
[2024-01-15 14:23:48] Killed
No error message, no warning—just "Killed". The kernel OOM killer had terminated the process for exceeding its memory limit.
The problem was our limit was too low. We'd set it at 1GB based on average usage, but our batch processing had occasional spikes to 1.2GB when processing large datasets. The OOM killer doesn't care about averages—it kills processes the instant they exceed their limit.
We increased the limit to 2GB and added memory monitoring:
// Add to application code
const v8 = require('v8');
setInterval(() => {
const heapStats =
Never Miss an Article
Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.
Comments (0)
Please log in to leave a comment.
Log InRelated Articles
Code Review and Testing: What We Learned Scaling to 50M Requests/Day
Apr 21, 2026
Optimizing Quantum Circuit Synthesis with Qiskit 0.39 and Cirq 1.2: A Comparative Analysis of Techniques for Quantum Machine Learning
Feb 28, 2026
A Comprehensive Comparison of Cloud Providers: AWS, Azure, Google Cloud
Mar 26, 2026