Bekzod Erkinov
Listen to Article
Loading...Three years ago, my team was running Jenkins on a single EC2 instance. We thought we were doing great—builds took 8-12 minutes, deployments happened twice a week, and everything "just worked." Then we hit 50 developers and suddenly our CI/CD became the bottleneck everyone complained about in standups.
What followed was an 18-month journey through three different CI/CD platforms. We migrated from Jenkins to Travis CI, got frustrated with Travis's limitations at scale, then moved to CircleCI, and eventually... well, I'll get to that. Each migration taught us something the documentation never mentioned, cost us more than we budgeted, and revealed trade-offs that only appear when you're running 200+ builds per day.
This isn't another feature comparison chart. I'm going to share what actually happened when we scaled each platform, the specific problems we hit, the costs we didn't anticipate, and the honest reasons we made each decision. If you're evaluating CI/CD tools for a team larger than 10 developers, here's what I wish someone had told me before we started.
The Jenkins Years: When Self-Hosting Seemed Like a Good Idea
We started with Jenkins because, honestly, everyone starts with Jenkins. It's free, it's flexible, and our senior DevOps engineer Jake had been using it for a decade. "I can set this up in an afternoon," he said. That was technically true—the basic setup took 4 hours. Making it production-ready took 3 months.
Our initial Jenkins setup ran on a t3.xlarge EC2 instance (4 vCPU, 16GB RAM). We had 12 developers, maybe 30-40 builds per day, and everything felt fine. Our Jenkinsfile looked pretty standard:
pipeline {
agent any
stages {
stage('Build') {
steps {
sh 'npm ci'
sh 'npm run build'
}
}
stage('Test') {
steps {
sh 'npm test -- --coverage'
}
}
stage('Deploy to Staging') {
when {
branch 'develop'
}
steps {
sh './deploy-staging.sh'
}
}
}
post {
always {
junit 'test-results/**/*.xml'
publishHTML([reportDir: 'coverage', reportFiles: 'index.html'])
}
}
}
This worked beautifully until it didn't. The problems started appearing around month 4 when we grew to 25 developers. Suddenly builds were queuing. A simple PR that should've taken 6 minutes was waiting 15 minutes just to start. Our Jenkins master was maxing out CPU during peak hours (10am-2pm), and developers were getting increasingly frustrated.
The Hidden Complexity of Jenkins at Scale
Here's what nobody tells you about Jenkins: the initial setup is easy, but making it scale is an entirely different beast. We needed to add build agents, which meant:
- Setting up agent nodes: We provisioned 3 additional t3.large instances as permanent agents
- Configuring SSH credentials: Each agent needed SSH keys, proper user permissions, and security group rules
- Installing dependencies on every agent: Node.js, Docker, AWS CLI, kubectl—the list grew to 15+ tools
- Managing agent labels: Different projects needed different agent configurations
- Monitoring agent health: Agents would randomly disconnect and we wouldn't notice until builds failed
Our infrastructure costs jumped from $180/month (single master) to $720/month (master + 3 permanent agents). But the real cost was maintenance time. Jake was spending 10-15 hours per week just keeping Jenkins running. Plugin updates broke things. Agents went offline mysteriously. The Jenkins master ran out of disk space twice because we didn't configure log rotation properly.
The agent management became particularly painful when we realized different teams needed different build environments. Our frontend team needed Node.js 18, but our legacy API still required Node.js 14. Our data pipeline team needed Python 3.9 with specific scientific computing libraries. Our mobile team needed Android SDK and Gradle. We ended up creating specialized agents with labels like nodejs-18, python-data, and android-build, which meant maintaining separate AMIs for each agent type.
The real nightmare was keeping these agents synchronized. When we needed to update a shared dependency like Docker, we had to update it across all agents. We tried using Ansible for this, but that meant maintaining Ansible playbooks on top of everything else. One time, we updated Docker on three agents but forgot the fourth, and spent two hours debugging why builds were failing inconsistently—turns out the old Docker version had a bug that the newer version fixed, but only some builds hit the problematic agent.
I remember one particularly bad Friday. We had a critical hotfix that needed to deploy, but all our agents were stuck in "disconnected" state. Jake spent 2 hours SSH-ing into each agent, restarting the Jenkins agent service, and manually triggering builds. The hotfix that should've taken 20 minutes took 3 hours. Our CTO Sarah was not happy.
The post-mortem on that incident revealed the root cause: the Jenkins master had run out of file descriptors because we hadn't tuned the Linux kernel parameters. The fix was a single line in /etc/security/limits.conf, but finding that took digging through Jenkins logs, system logs, and eventually Stack Overflow. These are the kinds of operational details that don't appear in the "Getting Started with Jenkins" tutorials.
The Plugin Ecosystem: Both Blessing and Curse
The Jenkins plugin ecosystem is massive—over 1,800 plugins available. This sounds great until you realize that maintaining a Jenkins instance means managing plugin dependencies, compatibility, and updates. We learned this the hard way when we updated the Git plugin and it broke our webhook integrations. Builds stopped triggering automatically, and we didn't notice for six hours because the Jenkins UI showed no errors.
Some plugins we couldn't live without:
Blue Ocean - Made the Jenkins UI actually usable. The classic Jenkins UI looks like it's from 2005 (because it basically is). Blue Ocean provided a modern interface that our developers didn't hate. But it was a separate plugin that sometimes got out of sync with the main Jenkins version.
Pipeline: AWS Steps - Gave us native AWS integration for deployments. We could update ECS services, upload to S3, and invalidate CloudFront distributions directly from pipeline steps. The alternative was writing bash scripts that wrapped the AWS CLI, which was error-prone and harder to maintain.
Configuration as Code (JCasC) - This plugin was a lifesaver for managing Jenkins configuration in version control. Instead of clicking through the UI to configure jobs, we could define everything in YAML:
jenkins:
systemMessage: "Production Jenkins - Handle with Care"
numExecutors: 2
securityRealm:
ldap:
configurations:
- server: "ldap.company.com"
rootDN: "dc=company,dc=com"
userSearchBase: "ou=users"
authorizationStrategy:
projectMatrix:
permissions:
- "Overall/Read:authenticated"
- "Job/Build:developers"
- "Job/Cancel:developers"
credentials:
system:
domainCredentials:
- credentials:
- usernamePassword:
scope: GLOBAL
id: "github-token"
username: "jenkins-bot"
password: "${GITHUB_TOKEN}"
- aws:
scope: GLOBAL
id: "aws-credentials"
accessKey: "${AWS_ACCESS_KEY}"
secretKey: "${AWS_SECRET_KEY}"
jobs:
- script: >
folder('frontend') {
description('Frontend application pipelines')
}
- script: >
pipelineJob('frontend/main-branch') {
definition {
cpsScm {
scm {
git {
remote {
url('https://github.com/company/frontend.git')
credentials('github-token')
}
branch('main')
}
}
scriptPath('Jenkinsfile')
}
}
}
But even JCasC had limitations. Some plugins didn't support it, which meant we had a hybrid approach—some configuration in YAML, some in the UI. Keeping track of what was configured where became its own challenge.
Slack Notification Plugin - Essential for keeping the team informed. We configured it to send notifications for failed builds, successful deployments, and builds that took longer than expected. But setting up the notification templates required learning yet another DSL:
post {
failure {
slackSend(
color: 'danger',
message: """
*Build Failed* :x:
Job: ${env.JOB_NAME}
Build: ${env.BUILD_NUMBER}
Author: ${env.CHANGE_AUTHOR}
Branch: ${env.GIT_BRANCH}
Duration: ${currentBuild.durationString}
""",
channel: '#deployments'
)
}
success {
slackSend(
color: 'good',
message: """
*Deployment Successful* :white_check_mark:
Job: ${env.JOB_NAME}
Environment: ${env.DEPLOY_ENV}
Version: ${env.GIT_COMMIT.take(7)}
""",
channel: '#deployments'
)
}
}
The plugin update process was anxiety-inducing. Jenkins would show a notification that 23 plugins had updates available. But updating them was risky—we had to read the changelog for each plugin to check for breaking changes, update in a staging Jenkins instance first, test thoroughly, then update production during a maintenance window. We scheduled these updates monthly, and they took 2-3 hours each time.
What Jenkins Does Really Well
Despite the operational headaches, Jenkins has some genuinely powerful features that we came to appreciate:
Pipeline flexibility is unmatched. You can do literally anything in a Jenkins pipeline. Need to query a database during your build? Sure. Want to trigger builds based on custom webhook payloads? No problem. Need to integrate with that obscure internal tool your company built in 2010? Jenkins can do it.
We built some pretty complex pipelines that I don't think we could've replicated easily elsewhere. One example was our multi-stage deployment pipeline that:
- Built Docker images for 5 microservices
- Ran integration tests against a dynamically provisioned test environment
- Deployed to staging with blue-green deployment
- Ran smoke tests
- Required manual approval from QA
- Deployed to production across 3 regions sequentially
- Ran post-deployment health checks
- Automatically rolled back if health checks failed
This pipeline had 47 stages and took 35 minutes to complete successfully. It was beautiful in its complexity. Here's a simplified version of the deployment stage:
stage('Deploy to Production') {
steps {
script {
def regions = ['us-east-1', 'eu-west-1', 'ap-southeast-1']
for (region in regions) {
echo "Deploying to ${region}"
// Deploy new version
sh """
aws ecs update-service \
--cluster production-${region} \
--service api-service \
--force-new-deployment \
--region ${region}
"""
// Wait for deployment to stabilize
timeout(time: 10, unit: 'MINUTES') {
sh """
aws ecs wait services-stable \
--cluster production-${region} \
--services api-service \
--region ${region}
"""
}
// Run health checks
def healthCheckPassed = sh(
script: "./health-check.sh ${region}",
returnStatus: true
) == 0
if (!healthCheckPassed) {
error("Health check failed in ${region}. Rolling back.")
}
echo "Successfully deployed to ${region}"
sleep(time: 2, unit: 'MINUTES') // Stagger deployments
}
}
}
}
We also built a sophisticated pipeline for our database migrations that showcased Jenkins's scripting power:
stage('Database Migration') {
steps {
script {
// First, create a backup
def backupId = sh(
script: """
aws rds create-db-snapshot \
--db-instance-identifier production-db \
--db-snapshot-identifier migration-backup-${BUILD_NUMBER} \
--query 'DBSnapshot.DBSnapshotIdentifier' \
--output text
""",
returnStdout: true
).trim()
echo "Created backup: ${backupId}"
// Wait for backup to complete
timeout(time: 20, unit: 'MINUTES') {
waitUntil {
def status = sh(
script: """
aws rds describe-db-snapshots \
--db-snapshot-identifier ${backupId} \
--query 'DBSnapshots[0].Status' \
--output text
""",
returnStdout: true
).trim()
return status == 'available'
}
}
echo "Backup completed successfully"
// Run migrations
try {
sh """
export DB_HOST=production-db.cluster-xyz.us-east-1.rds.amazonaws.com
export DB_NAME=production
export DB_USER=migration_user
export DB_PASSWORD=${env.DB_PASSWORD}
npm run migrate:up
"""
echo "Migrations completed successfully"
// Verify data integrity
def verificationPassed = sh(
script: "./verify-migration.sh",
returnStatus: true
) == 0
if (!verificationPassed) {
error("Migration verification failed!")
}
} catch (Exception e) {
echo "Migration failed! Rolling back..."
// Restore from backup
sh """
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier production-db-restore-${BUILD_NUMBER} \
--db-snapshot-identifier ${backupId}
"""
error("Migration failed and rollback initiated: ${e.message}")
}
}
}
}
This level of control and conditional logic was invaluable for complex operational tasks. We could check conditions, make decisions, handle errors, and even interact with external APIs—all within the pipeline.
Plugin ecosystem is massive. There's a Jenkins plugin for everything. We used 43 plugins at our peak, including some really niche ones like the "Slack Notifier with Custom Message Templates" plugin that let us send beautifully formatted build notifications.
Beyond the mainstream plugins, we found some hidden gems:
- Build Timeout Plugin - Automatically killed builds that ran too long. We set a 30-minute timeout for most builds, which caught several infinite loops in test code.
- Timestamper Plugin - Added timestamps to every console log line. Seems trivial, but incredibly useful when debugging slow builds.
- AnsiColor Plugin - Preserved ANSI color codes in console output. Made logs much more readable, especially for test results.
- Lockable Resources Plugin - Prevented concurrent deployments to the same environment. Critical for avoiding race conditions during deployments.
Complete control over the environment. When you're self-hosting, you control everything. Need to install a specific version of Python that's not available in standard images? Just install it. Need to mount a network drive during builds? Configure it once on the agent. Need to access internal services that aren't internet-accessible? Your Jenkins agents are already inside your VPC.
This control was particularly valuable for our security compliance requirements. We had to ensure that build artifacts never left our VPC, that secrets were stored in our internal Vault instance, and that all build logs were retained for audit purposes. With Jenkins running on our own infrastructure, we could implement these requirements without workarounds.
We also leveraged this control for performance optimization. We mounted an EFS volume for shared build caches, which dramatically sped up builds that used the same dependencies. Our Node.js builds could share a common node_modules cache, and our Docker builds shared layer caches. This wouldn't have been possible with a managed CI service.
The Cost Reality of Self-Hosted Jenkins
But let's talk about what Jenkins actually cost us, because the "free and open source" pitch is misleading once you factor in the total cost of ownership.
Infrastructure costs:
- Jenkins master: t3.xlarge ($140/month)
- 3 permanent agents: t3.large ($105/month each = $315/month)
- EBS volumes for build caches: 500GB ($50/month)
- EFS for shared caches: 200GB ($60/month)
- ALB for Jenkins UI: ($23/month)
- Total infrastructure: $588/month
Hidden costs:
- Jake's time (15 hours/week × 4 weeks × $75/hour loaded cost): $4,500/month
- Developer time lost to CI issues (estimated 2 hours/week × 25 developers × $60/hour): $3,000/month
- Opportunity cost of Jake not working on other infrastructure improvements: Immeasurable but significant
Real monthly cost: ~$8,088
That's not free. That's more expensive than most commercial CI/CD solutions for our team size. And this doesn't even account for the stress and frustration.
The Breaking Point
The final straw came during a critical product launch. We had coordinated a release with Marketing—press releases scheduled, customers notified, the works. At 2 PM on launch day, Jenkins crashed. Not a graceful failure—a complete crash.
Unlock Premium Content
You've read 30% of this article
What's in the full article
- Complete step-by-step implementation guide
- Working code examples you can copy-paste
- Advanced techniques and pro tips
- Common mistakes to avoid
- Real-world examples and metrics
Don't have an account? Start your free trial
Join 10,000+ developers who love our premium content
Keep reading
Mastering CI/CD Pipelines with Jenkins and Docker: A Deep Dive into Automated Deployment and Testing
14 min · 241 views
Mobile DevelopmentComplete Solution: Scaling a Node.js Application with Kubernetes and Docker
29 min · 178 views
Mobile DevelopmentImproving Website Performance and User Experience: A Deep Dive
18 min · 173 views
Bekzod Erkinov
AuthorFounder of NextGenBeing. Software engineer working with Laravel, Python, and cloud infrastructure. Writes about patterns that actually hold up in production. Based in Tashkent, Uzbekistan.
Never Miss an Article
Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.
Comments (0)
Please log in to leave a comment.
Log In