Docker Multi-Stage Builds: Cut Image Size by 80%
My CI pipeline was taking 12 minutes per run. I assumed GitHub Actions runners were just busy. When I actually measured it, 80% of the time was eaten by the Docker push step. I checked the image size: 1.2 GB.
TL;DR: A Spring Boot image built with FROM openjdk:17 will balloon to over 1 GB. You only need a JDK to compile — at runtime, a JRE is enough. Splitting the Dockerfile into a JDK build stage and a JRE runtime stage brought mine from 1.2 GB down to 245 MB. CI time dropped from 12 minutes to 4. Trivy alerts went from 140 to 18.
What I Was Trying to Do
I was running a pipeline that built a Spring Boot Docker image in GitHub Actions, pushed it to ECR, and deployed to ECS.
My original Dockerfile looked like this:
FROM openjdk:17
WORKDIR /app
COPY . .
RUN ./gradlew bootJar
CMD ["java", "-jar", "build/libs/app.jar"]It worked. Tests passed. Deploys succeeded. I left it alone for a while.
The problem surfaced when I looked at CI duration metrics on a Datadog dashboard. One week it quietly crossed 12 minutes. I dug into the logs and found the push step: "Pushing 1.2GB..."
What Went Wrong (and Why)
Why the Image Grew to 1.2 GB
openjdk:17 is a base image that includes the full JDK — Java Development Kit. That means javac, jdb, and a bunch of development tooling. Layered on top of a Debian OS, the base image alone exceeds 600 MB. Add Gradle's cache and the compiled application, and you're at 1.2 GB.
The thing is, you don't need a JDK to run a Java app. A compiled JAR only needs a JRE — Java Runtime Environment. I was shipping a compiler to production because it happened to work the first time and I never questioned it.
The Security Scanner Made It Impossible to Ignore
When I added Trivy to the CI pipeline, the same image that "worked" suddenly had 140 vulnerability alerts. The JDK tools that came along for the ride — compilers, debuggers, dev utilities — are common CVE targets. There's no operational reason to have them in a production runtime.
The Fix — Step by Step
What Multi-Stage Builds Actually Do
Docker's multi-stage build feature lets you use multiple FROM instructions in a single Dockerfile. Each stage can copy files from the previous one. The key part: only what you explicitly copy forward ends up in the final image. Build tools stay behind.
For Spring Boot, the approach is straightforward:
- Builder stage: Use a JDK image to run Gradle and produce the fat JAR
- Runtime stage: Copy only the JAR into a lightweight JRE image
# ---- Stage 1: Build ----
FROM eclipse-temurin:17-jdk-alpine AS builder
WORKDIR /app
COPY . .
RUN --mount=type=cache,target=/root/.gradle \
./gradlew bootJar --no-daemon
# ---- Stage 2: Runtime ----
FROM eclipse-temurin:17-jre-alpine
WORKDIR /app
COPY --from=builder /app/build/libs/*.jar app.jar
EXPOSE 8080
ENTRYPOINT ["java", "-jar", "app.jar"]I chose eclipse-temurin because the openjdk official image on Docker Hub stopped being updated in July 2022 and was officially deprecated in December of that year. Eclipse Temurin (from the Adoptium project) is built from the same OpenJDK source and is the recognized successor. The -alpine suffix switches the base OS to Alpine Linux, trimming the image further.
Managing Gradle's Cache with BuildKit
A common pattern you'll see in Dockerfiles is splitting COPY commands — dependency files first, then source code — to create a cacheable layer. This works well for Node.js: copy package.json, run npm install, then copy the rest. If package.json hasn't changed, Docker reuses the layer.
Gradle is trickier. The intuitive move is to run ./gradlew dependencies first, expecting it to download JARs into a cache layer. It doesn't. The dependencies task is a reporting task — it prints the dependency tree but doesn't download artifacts to disk. I wasted an afternoon on this before checking the Gradle docs.
The right tool here is BuildKit's --mount=type=cache:
RUN --mount=type=cache,target=/root/.gradle \
./gradlew bootJar --no-daemonThis persists the Gradle local cache directory across builds — outside of image layers entirely. On the same host, subsequent builds reuse whatever Gradle already downloaded. BuildKit has been the default builder since Docker 23.0, so no extra setup is needed. Local rebuild times dropped dramatically after this change.
Results
| Metric | Before | After |
|---|---|---|
| Image size | 1.2 GB | 245 MB |
| CI time (including push) | 12 min | 4 min |
| Trivy alerts | 140 | 18 |
The 245 MB breaks down as: JRE-alpine (~190 MB) + Spring Boot fat JAR (~55 MB).
What I'd Do Differently
I would have started with multi-stage from day one.
A Dockerfile that just works stays invisible until something forces you to look at it — a slow CI pipeline, a security scan, a cost review. Rewriting it isn't hard, but the longer you wait, the more services get cloned from the same template. I fixed the Spring Boot app and then found three more services built the same way. Four rewrites instead of one.
I would have set base image standards upfront.
openjdk:17 stuck around because the team had no written criteria for choosing base images. After this, we put it in the runbook: check maintenance status, prefer alpine variants, evaluate distroless for non-interactive workloads. That conversation takes 30 minutes. Not having it cost us hours of cleanup.
Key Takeaways
Multi-stage builds solve two problems at once.
- Image size: Build tools — JDK, Node.js devDependencies, Rust toolchains — never need to reach production. Multi-stage makes that the default rather than something you have to consciously strip out.
- Security: Fewer packages in the runtime image means fewer CVE targets. The 140 → 18 Trivy alert reduction wasn't from patching anything. It was from shipping less.
One thing worth knowing: --mount=type=cache is the correct way to cache Gradle dependencies across Docker builds. The pattern of splitting COPY commands and running ./gradlew dependencies is widely repeated but doesn't actually download JARs — it's a reporting task. I wasted an afternoon on this before checking the docs.
FAQ
What is a Docker multi-stage build?
A multi-stage build uses multiple FROM instructions in a single Dockerfile. Each stage can copy files from the previous one, but only what you explicitly copy forward ends up in the final image. Build tools, compilers, and intermediate artifacts stay behind.
How much can multi-stage builds reduce Docker image size?
It depends on the language and base images involved. Switching from a JDK image to a JRE image on Alpine typically reduces size by 45–65% (based on docker-library/repo-info measurements, which change with each minor JDK release — treat these as reference figures, not fixed numbers). Node.js sees similar reductions when you run npm install in a build stage and copy only production node_modules to the runtime stage.
Are Alpine (-alpine) base images secure?
Alpine Linux uses musl libc and BusyBox, which means fewer installed packages than Debian or Ubuntu-based images. Fewer packages means a smaller attack surface — that part is real. That said, Trivy often reports fewer CVEs against Alpine images partly because the Alpine Security DB only tracks fixed vulnerabilities by default. The behavior varies by Trivy version and configuration, so check the docs for the version you're running rather than treating a low alert count as a security guarantee. One practical caveat: binaries linked against glibc — some JDBC drivers, for instance — won't run on musl. Test your app on Alpine before committing to it.
What's the difference between eclipse-temurin and openjdk?
Both are built from the same OpenJDK source and are functionally equivalent for most use cases. The difference is who builds and maintains them. The openjdk image on Docker Hub stopped receiving updates in July 2022 and was officially deprecated on December 20, 2022 (docker-library/openjdk issue #505). Eclipse Temurin, maintained by the Eclipse Adoptium project, has passed Java SE TCK certification and provides guaranteed LTS support. For any new project today, eclipse-temurin is the standard choice.
Is Google's distroless image smaller than Alpine?
No — Alpine is smaller. Google's distroless Java images are Debian-based, which makes them larger than eclipse-temurin:17-jre-alpine. The advantage of distroless isn't size — it's that there's no shell, no package manager, and no other tooling that could be exploited at runtime. That's a meaningful security difference for production workloads that don't need interactive access. The tradeoff is that you can't exec into the container to inspect state (the :debug tag includes a BusyBox shell, but that's for debugging only). If you're considering distroless, check the official GitHub (github.com/GoogleContainerTools/distroless) for the currently active tag — it's a fast-moving project and tag names change.
This post is based on my experience working as an SRE across multiple organizations. Some details — timelines, internal conversations, decision-making moments — are reconstructed from that experience and are not verbatim records. Any information that could identify a specific company or individual has been omitted or generalized.
Related Articles
- How to Set Up Zero-Downtime Deployment on AWS ECSECS rolling update defaults don't give you zero downtime. Here's the three-layer fix — graceful shutdown, ALB deregistration delay, and stopTimeout — that ended our deploy-time 502s.
- Setting Up AWS CodeBuild as a GitHub Actions Runner: No More Self-Managed EC2Burned out managing EC2 self-hosted runners, I switched to CodeBuild-managed runners. Here's the full setup — including the Webhook and IAM gotchas that cost me a day.
- Why Your CI Pipeline Is Slow (And How to Fix It)CI pipelines slow down for four reasons: missing cache, sequential jobs, no path filtering, and broken Docker layer cache. I diagnosed a 32-minute pipeline and cut it down to about 15 minutes.