How to Set Up Zero-Downtime Deployment on AWS ECS

Q: What triggers the ECS deployment circuit breaker?

The failure threshold is calculated as ceil(0.5 × desired task count), with a minimum of 3 and a maximum of 200. With 1–6 desired tasks, it fires after 3 failures. With 25 tasks, it fires after 13. Even small services can hit the minimum of 3 quickly — which is exactly why pairing DeploymentCircuitBreaker with HealthCheckGracePeriodSeconds matters. Without the grace period, normal startup time can count as failures and trigger rollback before the app is even ready.

When I enabled ECS rolling updates, I thought I'd finally nailed zero-downtime deploys. What I actually got was 5–10 seconds of scattered 502s on every deploy. I only noticed because Datadog's error rate graph told me.

TL;DR: ECS rolling update doesn't give you true zero downtime out of the box. You need three things aligned: ALB deregistration delay, app-level graceful shutdown, and ECS stopTimeout. I had to fix my deploy process three times before production deploys stopped generating 502s.

What I Was Trying to Do

I was migrating a SaaS backend from EC2 to ECS. The reason was simple: I wanted "build and push a Docker image = deploy done." Before ECS, we had a 10-page runbook for copying JARs onto EC2 servers by hand.

I configured the ECS service with rolling update, Minimum healthy percent: 50, Maximum percent: 200. Half the tasks always running during deploys — zero downtime, right?

Wrong.

What Went Wrong (and Why)

The problem had three layers.

Layer 1: The App Was Ignoring SIGTERM

When ECS stops a task, it sends SIGTERM first. If your app catches SIGTERM and finishes in-flight requests before exiting, connections close cleanly.

Our app was a Java Spring Boot service with no SIGTERM handling. The JVM exits immediately on SIGTERM. Every request that was mid-flight got cut off, no questions asked.

Layer 2: stopTimeout Was Too Short

After sending SIGTERM, ECS waits stopTimeout seconds before sending SIGKILL. Default: 30 seconds. Meanwhile, ALB's deregistration delay — the grace period before ALB removes a target from its target group — defaults to 300 seconds.

300 > 30. ECS killed the task while ALB was still draining it. Requests that arrived during draining hit a dead task — 502.

Layer 3: No Health Check Grace Period

New tasks took 10–15 seconds to initialize (Spring Boot startup time). With no healthCheckGracePeriodSeconds set, ECS started health checking immediately after task launch, saw failures, and tried to replace the task — a loop that made things worse.

The Fix — Step by Step

I fixed the three problems in order.

Step 1: Add Graceful Shutdown to the App

For Spring Boot, setting server.shutdown=graceful makes the app catch SIGTERM, stop accepting new requests, and exit only after all in-flight requests finish.

src/main/resources/application.yml

server:
  shutdown: graceful
 
spring:
  lifecycle:
    timeout-per-shutdown-phase: 30s

Set timeout-per-shutdown-phase based on your longest request processing time. Our p99 response time was under 3 seconds, so 30 seconds gave plenty of headroom.

Step 2: Tune the ALB Deregistration Delay

Deregistration delay gives existing connections time to finish after ALB removes the target. The default 300 seconds is too long for most services.

How to pick the value:

It must be longer than your app's graceful shutdown timeout
Our setting: graceful shutdown 30s + 10s buffer = 40 seconds

In CloudFormation:

cloudformation/service.yml

MyTargetGroup:
  Type: AWS::ElasticLoadBalancingV2::TargetGroup
  Properties:
    TargetGroupAttributes:
      - Key: deregistration_delay.timeout_seconds
        Value: "40"

Step 3: Set ECS stopTimeout

I set stopTimeout to deregistration delay (40s) + graceful shutdown (30s) + buffer (10s) = 80 seconds.

cloudformation/service.yml

# CloudFormation: Task Definition
TaskDefinition:
  Type: AWS::ECS::TaskDefinition
  Properties:
    ContainerDefinitions:
      - Name: app
        StopTimeout: 80

Here's the sequence that now happens on every deploy:

ECS sends SIGTERM
App starts graceful shutdown (stops accepting new requests, finishes in-flight)
ALB starts deregistration (40-second drain)
After 40s: ALB drain completes
App exits
ECS marks task stopped

Step 4: Set the Health Check Grace Period

cloudformation/service.yml

Service:
  Type: AWS::ECS::Service
  Properties:
    HealthCheckGracePeriodSeconds: 60
    DeploymentConfiguration:
      MinimumHealthyPercent: 50
      MaximumPercent: 200
      DeploymentCircuitBreaker:
        Enable: true
        Rollback: true

HealthCheckGracePeriodSeconds: 60 tells ECS to ignore health check failures for the first 60 seconds after task launch. With Spring Boot taking 15 seconds to initialize, 60 seconds gives it room to breathe.

DeploymentCircuitBreaker automatically rolls back if tasks keep failing during a deploy. Without it, you can push a broken image and watch ECS replace every healthy task with broken ones before anyone notices.

Step 5: Measure It

After all four changes, I deployed to production and watched HTTPCode_Target_5XX_Count in ALB metrics. Before: dozens of 502s per deploy. After: zero, consistently.

What I'd Do Differently

Build graceful shutdown into the app from day one.

Infrastructure settings — deregistration delay, stopTimeout — are easy to tune later. App-level SIGTERM handling isn't. Once you have multiple microservices, retrofitting graceful shutdown across all of them is real work. Build it in early, when the cost is low.

Build a culture of measuring deploy behavior in staging.

I found the 502s in production, not staging. If we'd been tracking error rates during staging deploys from the start, we would have caught this before it ever touched users. Treating deploy correctness as a metric — not just a checkbox — is the right habit.

Key Takeaways

Zero-downtime on ECS is about aligning three timers correctly.

The three values have a clear ordering: stopTimeout must be greater than the sum of deregistration delay and graceful shutdown timeout. Break that ordering anywhere and you get forced disconnections — and 502s.

Setting	Purpose	Our value
Graceful shutdown timeout	Time for the app to finish in-flight requests	30s
ALB deregistration delay	Time for ALB draining to complete	40s
ECS stopTimeout	Time before ECS sends SIGKILL	80s

FAQ

Q: How long should I set the ECS graceful shutdown timeout?

A: Start with your p99 response time, then multiply by 2–3x. If your longest requests take 5 seconds, 15–30 seconds is a reasonable target. Don't set it too high — a longer timeout means longer deploys. Check your actual response time distribution before deciding.

Q: Does this work for Node.js or Go, not just Spring Boot?

A: The ALB deregistration delay and ECS stopTimeout settings are language-agnostic. The app-level graceful shutdown logic varies, but the idea is the same everywhere: catch SIGTERM, stop accepting new connections, wait for in-flight requests to finish, then exit. In Node.js: process.on('SIGTERM', ...). In Go: signal.NotifyContext. The pattern transfers.

Q: What triggers the ECS deployment circuit breaker?

A: The failure threshold is calculated as ceil(0.5 × desired task count), with a minimum of 3 and a maximum of 200. With 1–6 desired tasks, it fires after 3 failures. With 25 tasks, it fires after 13. Even small services can hit the minimum of 3 quickly — which is exactly why pairing DeploymentCircuitBreaker with HealthCheckGracePeriodSeconds matters. Without the grace period, normal startup time can count as failures and trigger rollback before the app is even ready.

Q: Are there differences between Fargate and EC2 launch types?

A: The core settings — deregistration delay, stopTimeout, graceful shutdown — work the same on both. The maximum stopTimeout is 120 seconds on both launch types. For workloads that need longer shutdown windows — batch jobs, long-running transactions — consider ECS Run Task instead of a long-running service.

This article draws on experience working as an SRE across multiple organizations. Some descriptions — including specific timelines, team conversations, and decision-making moments — are reconstructed from memory and are not verbatim records. Information that could identify specific companies or individuals has been omitted or generalized.

Share:Post Discuss

What I Was Trying to Do#

What Went Wrong (and Why)#

Layer 1: The App Was Ignoring SIGTERM#

Layer 2: stopTimeout Was Too Short#

Layer 3: No Health Check Grace Period#

The Fix — Step by Step#

Step 1: Add Graceful Shutdown to the App#

Step 2: Tune the ALB Deregistration Delay#

Step 3: Set ECS stopTimeout#

Step 4: Set the Health Check Grace Period#

Step 5: Measure It#

What I'd Do Differently#

Key Takeaways#

FAQ#

Related Articles

What I Was Trying to Do

What Went Wrong (and Why)

Layer 1: The App Was Ignoring SIGTERM

Layer 2: stopTimeout Was Too Short

Layer 3: No Health Check Grace Period

The Fix — Step by Step

Step 1: Add Graceful Shutdown to the App

Step 2: Tune the ALB Deregistration Delay

Step 3: Set ECS stopTimeout

Step 4: Set the Health Check Grace Period

Step 5: Measure It

What I'd Do Differently

Key Takeaways

FAQ