Aaron Vasquez
Listen to Article
Loading...Last November, our team made the decision to rebuild our analytics dashboard as a serverless application on AWS Lambda. We were processing about 2 million API requests per month on a traditional EC2-based stack, and our infrastructure costs were climbing faster than our revenue. My CTO, Marcus, had been pushing for serverless for months, arguing we were paying for idle capacity 70% of the time. I was skeptical—I'd heard the horror stories about cold starts and vendor lock-in.
Three months later, we're handling 10 million requests per month, our infrastructure costs dropped by 60%, and I'm writing this guide because I wish someone had written it for me. But here's the thing: our first week in production was a disaster. We racked up $4,000 in unexpected Lambda costs because I made assumptions about how pricing worked. We had API endpoints timing out because I didn't understand execution context reuse. And we nearly lost a major client when our database connection pooling strategy fell apart under load.
This isn't going to be another tutorial that shows you how to deploy a "Hello World" Lambda function. You can find that in the AWS docs. Instead, I'm going to walk you through building a real production serverless application—the kind that handles actual user traffic, integrates with multiple AWS services, and needs to stay up when things go wrong. I'll show you the code we're actually running, the mistakes we made, and the hard-won lessons that aren't in any documentation.
Why We Chose Serverless (And Why You Might Not)
Before I dive into implementation details, let me be honest about the decision-making process. Serverless wasn't an obvious choice for us, and it might not be right for your use case either.
Our application is an analytics API that receives event data from client applications, processes it, stores it in DynamoDB, and serves aggregated reports through REST endpoints. Traffic is spiky—we get hit hard during business hours (9 AM to 6 PM EST) and see almost nothing at night. Our traditional EC2 setup meant we were paying for t3.large instances 24/7 to handle peak load, even though they sat at 5% CPU utilization for 16 hours a day.
I ran the numbers. Our monthly EC2 costs were around $850 (three t3.large instances for redundancy, plus load balancer). Our actual compute needs during peak hours suggested we needed maybe 2-3 hours of full capacity daily. Lambda's pricing model—pay per request and execution time—meant we'd only pay for what we actually used.
But here's what I didn't consider initially: Lambda has limitations that can bite you hard. Each function execution is limited to 15 minutes maximum. Memory ranges from 128 MB to 10 GB. You can't maintain persistent connections between invocations (at least not reliably). And if your application needs sub-10ms response times consistently, cold starts will kill you.
We spent two weeks prototyping before committing. I built a proof-of-concept version of our event ingestion endpoint and load-tested it with Apache Bench. Here's what that looked like:
ab -n 10000 -c 100 -p event.json -T application/json \
https://api.example.com/events
Output from the initial test:
Concurrency Level: 100
Time taken for tests: 45.234 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 2890000 bytes
Requests per second: 221.08 [#/sec] (mean)
Time per request: 452.340 [ms] (mean)
Time per request: 4.523 [ms] (mean, across all concurrent requests)
Those numbers were acceptable for our use case. Most importantly, failed requests were zero. But I noticed something concerning in CloudWatch metrics—about 15% of requests were taking 800-1200ms, while the rest were under 200ms. That was my first encounter with cold starts, and I didn't fully understand what I was seeing yet.
My colleague Sarah, who'd worked with Lambda at her previous company, warned me: "Cold starts are going to be your biggest pain point. You need a strategy from day one." She was right, but I didn't fully grasp it until we hit production.
The Architecture We Built (And How It Evolved)
Let me show you what our serverless architecture looks like now, after three months of iteration. This isn't what we started with—I'll explain the evolution as we go.
Our current setup consists of:
- API Gateway (REST API) as our entry point
- Six Lambda functions (three for ingestion, three for reporting)
- DynamoDB for data storage (two tables: events and aggregations)
- S3 for storing raw event data as backup
- CloudWatch for logging and monitoring
- EventBridge for scheduled aggregation jobs
- SQS for async event processing when we need guaranteed delivery
The request flow for our event ingestion endpoint looks like this:
- Client POSTs event data to API Gateway endpoint
- API Gateway triggers the
ingest-eventLambda function - Lambda validates the event, writes to DynamoDB, and sends raw data to S3
- Lambda returns a 202 Accepted response to the client
- DynamoDB Streams trigger the
process-eventLambda for async processing - Processed data goes into our aggregations table
Here's the critical lesson I learned: keep your Lambda functions small and focused. My first version tried to do everything in one function—validation, storage, processing, and aggregation. It was a 500-line monolith that took 3-4 seconds to execute and cost us a fortune. Breaking it into smaller functions reduced our average execution time to 200ms and cut costs by 70%.
Let me show you what the ingestion function looks like now:
// ingest-event/index.js
const AWS = require('aws-sdk');
const dynamodb = new AWS.DynamoDB.DocumentClient();
const s3 = new AWS.S3();
// Initialize clients outside handler for connection reuse
const TABLE_NAME = process.env.EVENTS_TABLE;
const BUCKET_NAME = process.env.RAW_EVENTS_BUCKET;
exports.handler = async (event) => {
const startTime = Date.now();
try {
// Parse and validate the incoming event
const body = JSON.parse(event.body);
const validationError = validateEvent(body);
if (validationError) {
return {
statusCode: 400,
body: JSON.stringify({ error: validationError })
};
}
// Generate unique ID and timestamp
const eventId = generateEventId();
const timestamp = Date.now();
// Prepare DynamoDB item
const item = {
eventId,
timestamp,
userId: body.userId,
eventType: body.eventType,
properties: body.properties,
ttl: timestamp + (90 * 24 * 60 * 60) // 90 days TTL
};
// Write to DynamoDB and S3 in parallel
await Promise.all([
dynamodb.put({
TableName: TABLE_NAME,
Item: item
}).promise(),
s3.putObject({
Bucket: BUCKET_NAME,
Key: `events/${new Date().toISOString().split('T')[0]}/${eventId}.json`,
Body: JSON.stringify(body),
ContentType: 'application/json'
}).promise()
]);
const duration = Date.now() - startTime;
console.log(`Event processed in ${duration}ms`);
return {
statusCode: 202,
body: JSON.stringify({
eventId,
message: 'Event accepted for processing'
})
};
} catch (error) {
console.error('Error processing event:', error);
return {
statusCode: 500,
body: JSON.stringify({
error: 'Internal server error',
requestId: event.requestContext.requestId
})
};
}
};
function validateEvent(body) {
if (!body.userId) return 'userId is required';
if (!body.eventType) return 'eventType is required';
if (!body.properties || typeof body.properties !== 'object') {
return 'properties must be an object';
}
return null;
}
function generateEventId() {
return `evt_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
}
When I deploy this function and test it, here's what the CloudWatch logs look like:
START RequestId: a1b2c3d4-e5f6-7890-abcd-ef1234567890 Version: $LATEST
Event processed in 145ms
END RequestId: a1b2c3d4-e5f6-7890-abcd-ef1234567890
REPORT RequestId: a1b2c3d4-e5f6-7890-abcd-ef1234567890
Duration: 147.82 ms Billed Duration: 148 ms Memory Size: 512 MB Max Memory Used: 78 MB
Init Duration: 312.45 ms
Notice that Init Duration? That's the cold start—312ms to initialize the execution environment. For warm invocations (when Lambda reuses an existing environment), that Init Duration disappears and we're down to just the 148ms execution time.
The Cold Start Problem (And Three Solutions That Actually Work)
Cold starts nearly killed our production launch. Here's what happened: we deployed on a Friday afternoon (mistake #1—never deploy on Friday). Traffic was low initially, so everything looked fine. Monday morning at 9 AM, we got slammed with 500 requests in the first minute. About 30% of those requests timed out.
I pulled up CloudWatch Insights and ran this query to see what was going on:
fields @timestamp, @duration, @initDuration
| filter @type = "REPORT"
| stats avg(@duration), avg(@initDuration), max(@duration), max(@initDuration) by bin(5m)
The results were brutal:
Average Duration: 156ms
Average Init Duration: 1,247ms (when present)
Max Duration: 2,341ms
Max Init Duration: 3,892ms
Percentage of requests with cold starts: 28%
Nearly a third of our requests were experiencing cold starts over a second long. Our API Gateway timeout was set to 3 seconds, so we were barely squeaking by. But some requests were hitting that 3.8-second initialization and timing out completely.
I tried three different approaches to solve this, and I'll tell you which ones actually worked in production.
Solution 1: Provisioned Concurrency (Expensive But Effective)
Provisioned Concurrency keeps Lambda execution environments warm and ready to respond immediately. You specify how many concurrent executions to keep initialized, and AWS maintains that pool for you.
I configured it through the AWS Console initially, then moved to Infrastructure as Code with the AWS CDK:
// cdk/lambda-stack.js
const lambda = require('@aws-cdk/aws-lambda');
const ingestFunction = new lambda.Function(this, 'IngestEvent', {
runtime: lambda.Runtime.NODEJS_18_X,
handler: 'index.handler',
code: lambda.Code.fromAsset('lambda/ingest-event'),
memorySize: 512,
timeout: Duration.seconds(5),
environment: {
EVENTS_TABLE: eventsTable.tableName,
RAW_EVENTS_BUCKET: rawEventsBucket.bucketName
}
});
// Add provisioned concurrency
const version = ingestFunction.currentVersion;
const alias = new lambda.Alias(this, 'IngestEventAlias', {
aliasName: 'production',
version: version,
provisionedConcurrentExecutions: 5
});
The impact was immediate. Cold start percentage dropped from 28% to less than 1%. But here's the catch: Provisioned Concurrency is expensive. I was paying $0.015 per GB-hour for 5 concurrent executions with 512 MB memory. That's about $27 per month just to keep functions warm, on top of the actual execution costs.
For our high-traffic ingestion endpoint, it was worth it. For our reporting endpoints that get hit maybe 100 times per day? Absolutely not worth it.
Solution 2: Scheduled Warming (Clever But Fragile)
My second approach was to use EventBridge to ping our Lambda functions every 5 minutes, keeping them warm without paying for Provisioned Concurrency. I set up a scheduled rule:
// cdk/warming-stack.js
const events = require('@aws-cdk/aws-events');
const targets = require('@aws-cdk/aws-events-targets');
const warmingRule = new events.Rule(this, 'WarmingRule', {
schedule: events.Schedule.rate(Duration.minutes(5)),
description: 'Keep Lambda functions warm'
});
warmingRule.addTarget(new targets.LambdaFunction(ingestFunction, {
event: events.RuleTargetInput.fromObject({
source: 'warming',
action: 'ping'
})
}));
In my Lambda function, I added logic to detect and handle warming requests:
exports.handler = async (event) => {
// Handle warming pings
if (event.source === 'warming' && event.action === 'ping') {
console.log('Warming ping received');
return { statusCode: 200, body: 'warm' };
}
// Regular processing logic continues...
};
This reduced cold starts to about 8%, which was acceptable for our lower-traffic endpoints. But it's fragile—if traffic suddenly spikes and you need more than one concurrent execution, those additional invocations will still cold start. And you're paying for those warming invocations (though they're cheap since they do almost nothing).
Solution 3: Optimize Initialization (The Real Solution)
The approach that actually solved our cold start problem long-term was optimizing what happens during initialization. I profiled our Lambda function and discovered we were doing a ton of unnecessary work during cold starts.
Here's what I found:
- Loading the entire AWS SDK (all services) added 400ms
- Establishing database connections during initialization added 300ms
- Loading a large JSON configuration file added 150ms
I refactored to only load what we needed:
// Before: Loading entire AWS SDK
const AWS = require('aws-sdk');
const dynamodb = new AWS.DynamoDB.DocumentClient();
const s3 = new AWS.S3();
// After: Loading only specific clients
const { DynamoDB } = require('@aws-sdk/client-dynamodb');
const { DynamoDBDocument } = require('@aws-sdk/lib-dynamodb');
const { S3 } = require('@aws-sdk/client-s3');
// Initialize clients lazily
let dynamoClient;
let s3Client;
function getDynamoClient() {
if (!dynamoClient) {
const client = new DynamoDB({});
dynamoClient = DynamoDBDocument.from(client);
}
return dynamoClient;
}
function getS3Client() {
if (!s3Client) {
s3Client = new S3({});
}
return s3Client;
}
exports.handler = async (event) => {
const dynamo = getDynamoClient();
const s3 = getS3Client();
// Rest of handler logic...
};
I also moved configuration loading to happen lazily on first request:
let config;
function getConfig() {
if (!config) {
config = JSON.parse(process.env.CONFIG_JSON);
}
return config;
}
These optimizations reduced our cold start time from 1,247ms average to 412ms average—a 67% improvement. Combined with scheduled warming for our medium-traffic endpoints and Provisioned Concurrency for our highest-traffic endpoint, we got cold starts under control.
Unlock Premium Content
You've read 30% of this article
What's in the full article
- Complete step-by-step implementation guide
- Working code examples you can copy-paste
- Advanced techniques and pro tips
- Common mistakes to avoid
- Real-world examples and metrics
Don't have an account? Start your free trial
Join 10,000+ developers who love our premium content
Keep reading
Aaron Vasquez
AuthorCovers DevOps practices, CI/CD pipelines, Kubernetes, and platform engineering. Contributing author at NextGenBeing.
Never Miss an Article
Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.
Comments (0)
Please log in to leave a comment.
Log In