The Wrong SLI Almost Broke Our Reliability Culture
Our SLO was green while support was on fire. The real issue wasn't the SLI — it was using SLO as a report card. Here's how we fixed both.
Production incidents, AWS deep-dives, and CI/CD battle stories.
Our SLO was green while support was on fire. The real issue wasn't the SLI — it was using SLO as a report card. Here's how we fixed both.
Burned out managing EC2 self-hosted runners, I switched to CodeBuild-managed runners. Here's the full setup — including the Webhook and IAM gotchas that cost me a day.
Three months piloting EKS alongside ECS in production. What the upgrade overhead costs, what broke, and a four-question framework for the decision.
We hit CloudFormation's 500-resource hard limit mid-migration. Here's what broke, how we fixed it, and when to choose each tool.
CI pipelines slow down for four reasons: missing cache, sequential jobs, no path filtering, and broken Docker layer cache. I diagnosed a 32-minute pipeline and cut it down to about 15 minutes.
I put a self-hosted runner on EC2 and it died at 2am. Here's what broke, why non-ephemeral runners are a trap, and the step-by-step path to a production-ready setup.
My Spring Boot Docker image hit 1.2 GB. CI took 12 minutes per run and Trivy flagged 140 vulnerabilities. Multi-stage builds brought it down to 245 MB — here's exactly what I changed.
ECS rolling update defaults don't give you zero downtime. Here's the three-layer fix — graceful shutdown, ALB deregistration delay, and stopTimeout — that ended our deploy-time 502s.
Ran Jenkins for years before switching to GitHub Actions. We saw a 20% drop in release work — here's the reasoning I actually used.