GitHub Actions Self-Hosted Runners on AWS EC2: What No One Tells You

At 2am, our production release pipeline was stuck waiting for a runner. I SSH'd into the EC2 instance and found the runner process had been dead for hours. The disk was full.

TL;DR: Running GitHub Actions self-hosted runners on EC2 without careful setup means hitting three traps: no health monitoring, disk exhaustion, and state contamination between jobs. The real fix is ephemeral runners — one job, one instance, then terminate. This post walks through the failures I hit with a persistent runner and the concrete steps to go ephemeral.

What I Was Trying to Do

At the time, our team's GitHub Actions bill had crossed the equivalent of $350/month. Most of the workload was test suites and Docker image builds — the kind that ran slowly on GitHub-hosted runners (2-core, 7GB) and got more expensive the more we parallelized.

We also needed access to private ECR repositories and internal APIs inside a VPC. With GitHub-hosted runners, maintaining IP allowlists was painful — GitHub publishes hundreds of IP ranges, and they change.

The plan: put a self-hosted runner on a c5.2xlarge (8 vCPU, 16GB RAM), turn variable costs into a fixed EC2 bill. Setup was easy — run config.sh as documented, register it as a systemd service. The first week went fine.

What Went Wrong (and Why)

The runner died because the disk filled up

Self-hosted runners accumulate working files in _work/ with every job. GitHub-hosted runners throw the whole environment away after each run. A persistent EC2 runner doesn't clean up after itself — nobody does it automatically. After three weeks, the 30GB root volume was full. The runner process couldn't write logs, and it crashed.

This happened the night of a production release. Jobs queued up with "Waiting for a runner..." indefinitely. I got paged, SSH'd in, manually deleted the directory, and restarted the process. The release went out at 3am.

State contamination made tests fail unpredictably

This one was worse to diagnose. GitHub's own documentation states: self-hosted runners are not guaranteed to operate in a clean environment between jobs. The actions/checkout default (clean: true) does reset the workspace directory with git clean -ffdx. The problem is everything outside the workspace.

Specifically:

~/.docker/config.json — Docker credentials persist across the runner's home directory, so a job on one branch can inherit the auth state from a previous branch's job
/tmp/ — test-generated temp files accumulate and interfere with other branches' test runs
Globally installed tools (npm install -g, pip install, etc.) — versions start mixing in subtle ways

We had one or two tests per week failing in CI but not locally. Each investigation took one to two hours. The culprit was always contamination from outside the workspace. Persistent runners make this hard to reproduce — the state that caused the failure is gone by the time you look.

GitHub acknowledges this risk explicitly and recommends ephemeral runners. A persistent runner "works," but it works without guarantees. That distinction matters at 2am.

There was no monitoring on the runner itself

GitHub's UI shows an "Offline" badge under Settings → Actions → Runners, but you have to go look for it. There are email notifications, but nobody was checking them before a late-night release. The failure mode was: runner goes down, jobs queue silently, someone gets paged when a deployment doesn't finish.

The Fix — Step by Step

Step 1: Fix disk exhaustion immediately

I added a crontab entry to clean up the _work/ directory and a CloudWatch alarm on disk usage.

/etc/cron.d/github-runner-cleanup

# Clear _work/ every night at 4am
0 4 * * * runner /bin/rm -rf /home/runner/actions-runner/_work/*
 
# Log a warning when root volume exceeds 80%
*/15 * * * * runner df / | awk 'NR==2{if($5+0>80) print "DISK WARNING: "$5" used"}' | \
  logger -t disk-check

I also installed the CloudWatch agent to push disk_used_percent as a metric and set an alarm at 80%. That alone eliminated the midnight crashes.

Step 2: Monitor the runner process with CloudWatch

A simple script checks whether the runner service is active and pushes a custom metric:

/usr/local/bin/check-runner-health.sh

#!/bin/bash
if systemctl is-active --quiet actions.runner.*.service; then
  aws cloudwatch put-metric-data \
    --namespace GitHubRunner \
    --metric-name RunnerStatus \
    --value 1 \
    --unit Count
else
  aws cloudwatch put-metric-data \
    --namespace GitHubRunner \
    --metric-name RunnerStatus \
    --value 0 \
    --unit Count
fi

When RunnerStatus drops to 0, a CloudWatch alarm fires through SNS to Slack. The EC2 instance's IAM role needs cloudwatch:PutMetricData:

iam-runner-policy.json

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "cloudwatch:PutMetricData",
    "Resource": "*"
  }]
}

Step 3: Add post-job cleanup to your workflow

Not a root fix, but reduces contamination between runs:

.github/workflows/build.yml

jobs:
  build:
    runs-on: self-hosted
    steps:
      - uses: actions/checkout@v4
      # ... build steps ...
 
      # Runs on success and failure
      - name: Cleanup workspace
        if: always()
        run: |
          rm -rf ${{ github.workspace }}/*
          docker system prune -f --filter "until=24h"

Step 4: Go ephemeral — the real fix

Everything above is damage control. The real fix is one job, one instance.

GitHub's runner registration supports an --ephemeral flag: the runner deregisters itself after completing a single job. Pair that with EC2 on-demand provisioning and you get a clean environment for every run, no cleanup required.

The minimal architecture is a Lambda function triggered by GitHub's workflow_job webhook. When a job enters the queued state, Lambda starts a new EC2 instance:

github_runner_scaler.py

import boto3
import json
 
ec2 = boto3.client('ec2')
LAUNCH_TEMPLATE_ID = 'lt-xxxxxxxx'  # Launch Template with runner pre-baked
 
def lambda_handler(event, context):
    body = json.loads(event['body'])
 
    if body.get('action') == 'queued':
        ec2.run_instances(
            LaunchTemplate={'LaunchTemplateId': LAUNCH_TEMPLATE_ID},
            MinCount=1,
            MaxCount=1
        )
        return {'statusCode': 200}
 
    return {'statusCode': 204}

The EC2 instance's User Data registers the runner with --ephemeral, runs the job, then terminates itself:

user-data.sh

#!/bin/bash
cd /home/runner/actions-runner
 
# Get a registration token from the GitHub API
TOKEN=$(curl -sX POST \
  -H "Authorization: token ${GITHUB_PAT}" \
  "https://api.github.com/repos/${OWNER}/${REPO}/actions/runners/registration-token" \
  | jq -r .token)
 
# Register and run in ephemeral mode
./config.sh \
  --url "https://github.com/${OWNER}/${REPO}" \
  --token "${TOKEN}" \
  --ephemeral \
  --unattended
 
./run.sh
 
# Terminate this instance when the job is done
aws ec2 terminate-instances \
  --instance-ids $(curl -s http://169.254.169.254/latest/meta-data/instance-id)

After switching to this setup, the unpredictable test failures from state contamination dropped to zero.

What I'd Do Differently

Go ephemeral from day one. A persistent runner is easy to set up. The operational cost catches up with you — disk management, state management, health monitoring all become manual work. Paying the ephemeral setup cost upfront is cheaper than cleaning up the mess later.

Bake dependencies into the AMI. Starting from a plain Ubuntu AMI, it took five to seven minutes from job queue to job start — too long. After baking in Docker, the language runtimes, and the AWS CLI, I got that down to around 90 seconds. The dependencies stay current because I rebuild the AMI when they change, not on every job run.

That said, 90 seconds only makes sense if your jobs are long enough to absorb the overhead. For a 10-minute build, 90 seconds is fine. For a battery of 2-minute unit tests, GitHub-hosted runners — which typically start in 30 to 60 seconds outside peak hours — may give you shorter end-to-end cycle times. Ephemeral EC2 runners shine for heavy builds and jobs that need VPC access. Lightweight tests are often better left on GitHub-hosted.

Design IAM permissions from the start. When a runner needs ECR or S3 access, the temptation is to hand it AdministratorAccess to move fast. I did this.

Key Takeaways

Persistent EC2 runners are technical debt deferred, not avoided — go ephemeral from the start if you can
If you run persistent runners, disk monitoring + runner health monitoring + post-job cleanup are non-negotiable
GitHub's workflow_job webhook + Lambda + EC2 is enough to build a lightweight ephemeral scaler
IAM permissions: start minimal, add what you need — don't give a runner AdministratorAccess to move fast
Bake dependencies into your AMI; reinstalling them on every job is slow and unnecessary

FAQ

Q: Are self-hosted runners cheaper than GitHub-hosted runners?

A: It depends on your usage pattern and how you run the runners. The comparison isn't straightforward, so here's the framework rather than a single number.

For ephemeral runners (the setup described in this post): once your team exceeds the free included minutes in your plan, each additional minute on EC2 costs roughly a third of the GitHub-hosted overage rate — so ephemeral EC2 becomes cheaper per minute as soon as you're past the free tier. The break-even point is approximately at your plan's monthly free minute allowance.

For a persistent EC2 instance running around the clock: the math is much less favorable. You're paying for the instance whether it's processing jobs or idle, so you need a very high utilization rate to come out ahead.

For exact current rates, check the GitHub Actions billing docs and AWS EC2 pricing for your region. Factor in operational overhead (monitoring, maintenance, AMI rebuilds) and the math shifts further — for small teams, GitHub-hosted is often cheaper in total.

Q: Can self-hosted runners on EC2 reach private VPC resources like RDS or internal APIs?

A: Yes. Place the EC2 instance in a private subnet and open the necessary ports in the security group. The runner gets native VPC connectivity — no IP allowlists, no tunneling. This is one of the strongest reasons to use self-hosted runners; GitHub-hosted runners can't do this cleanly.

Q: Ephemeral runner startup takes too long. What can I do?

A: Start by baking your dependencies into the AMI. A plain Ubuntu AMI takes five to seven minutes from queue to job start; a pre-baked AMI with Docker, language runtimes, and tooling gets that to one to two minutes. For tighter startup requirements, EC2 Warm Pools — which keep instances in a stopped state, ready to start in seconds — can cut cold-start latency further. If startup time is still the problem after that, it's worth asking whether those jobs actually need a self-hosted runner. Short, lightweight jobs often run with lower total cycle time on GitHub-hosted runners.

Q: What are the security risks with self-hosted runners?

A: The main risk is that the runner becomes a foothold into your AWS environment. Code running on the runner — including code from pull requests — has access to whatever that EC2 instance's IAM role can do. Three mitigations: (1) keep IAM permissions minimal, (2) avoid using self-hosted runners on public repositories, or at minimum require approval before running workflows from forks, (3) use ephemeral runners so each job starts from a clean state with no residue from previous runs.

Q: Is there an alternative to the Lambda + EC2 approach for scaling ephemeral runners?

A: Yes. Actions Runner Controller (ARC) is the more complete solution — it runs runners on Kubernetes and handles autoscaling, and GitHub maintains it. If you already run EKS, ARC is worth evaluating. If you don't, standing up EKS just for ARC is significant overhead. The Lambda + EC2 approach covers most small-to-medium team needs with far less infrastructure to manage.

This post is a reconstruction based on experience across multiple organizations. Details that could identify specific companies or individuals have been omitted or generalized.

Share:Post Discuss

What I Was Trying to Do#

What Went Wrong (and Why)#

The runner died because the disk filled up#

State contamination made tests fail unpredictably#

There was no monitoring on the runner itself#

The Fix — Step by Step#

Step 1: Fix disk exhaustion immediately#

Step 2: Monitor the runner process with CloudWatch#

Step 3: Add post-job cleanup to your workflow#

Step 4: Go ephemeral — the real fix#

What I'd Do Differently#

Key Takeaways#

FAQ#

Related Articles