Back to Blog

Securing CI/CD Agents with Ephemeral Runners and Least Privilege

Securing CI/CD Agents with Ephemeral Runners and Least Privilege

The Continuous Integration/Continuous Deployment (CI/CD) pipeline is the heartbeat of modern software engineering. It is the automated engine that transforms source code into production-ready artifacts. However, this automation introduces a massive, often overlooked attack surface. If an attacker can compromise a build process-via a malicious pull request, a dependency confusion attack, or a compromised third-latency plugin-they don't just compromise a single application; they compromise the very factory that builds all your applications.

In the traditional CI/CD model, build agents (runners) are often long-lived, static virtual machines or persistent containers. These "pet" runners are a security nightmare. To mitigate this, we must pivot toward a security architecture built on two fundamental pillars: Ephemeral Runners and Least Privilege via Identity Federation.

The Peril of Persistent Runners

The primary danger of static build agents is persistence. When a runner is long-lived, it maintains a state across multiple disparate builds. An attacker who gains execution capabilities during a single compromised build can perform several high-impact maneuvers:

  1. Environment Poisoning: An attacker can modify system-level binaries (e.g., replacing `gcc` or `npm` with a wrapper that injects backdoors), alter `PATH` variables, or pollute local caches (like `~/.npm` or `~/.m2`).
  2. Lateral Movement: A persistent runner often holds long-lived credentials (IAM keys, SSH keys, or API tokens) in its environment or filesystem. Once an attacker gains access to the runner, they can exfiltrate these secrets to move laterally into your cloud infrastructure.
  3. Information Leakage: Residual data from previous builds-such as build logs, debug artifacts, or decrypted secrets-may remain on the disk, accessible to subsequent, potentially unauthorized, build jobs.

By treating runners as "cattle" rather than "pets," we eliminate the possibility of an attacker establishing a permanent foothold within the build infrastructure.

Pillar 1: The Ephemeral Runner Pattern

An ephemeral runner is a compute instance that is instantiated for a single job and destroyed immediately upon completion. This can be achieved using Kubernetes Pods (via the Kubernetes executor), AWS Fargate, or GitHub Actions' hosted runners.

The Mechanics of Ephemerality

In an ephemeral architecture, the lifecycle of the runner follows a strict pattern:

  1. Provisioning: A fresh, hardened container or VM is spun up from a known-good image.
  2. Execution: The CI/CD orchestrator dispatches the job.
  3. Termination: The instance is terminated, and its entire filesystem is wiped.

This "clean slate" approach effectively mitigates Environment Poisoning. Even if a malicious dependency successfully executes code during the build, the malicious modifications vanish the moment the job finishes. The attack surface is restricted to the duration of the specific job, significantly limiting the "blast radius."

Implementation Consideration: The Image Hardening

The security of ephemeral runners is only as good as the base image used to launch them. You must move away from "kitchen sink" images that contain compilers, network tools (`curl`, `netcat`), and cloud CLIs by default. Instead, use distroless or highly stripped-down images. If a build requires `terraform`, only include the Terraform binary and its minimal dependencies.

Pillar 2: Least Privilege through OIDC and Identity Federation

Even with ephemeral runners, the build process still needs permissions to interact with external services (e.g., pushing a Docker image to ECR, uploading an artifact to S3, or updating a Lambda function).

The legacy method for providing these permissions is storing long-lived secrets (like `AWS_ACCESS_le_ID` and `AWS_SECRET_ACCESS_KEY`) in the CI/CD provider's secret store. This is a high-risk practice. If the CI/CD provider is breached, your cloud credentials are leaked.

The Solution: OpenID Connect (OIDC)

Modern CI/CD platforms (GitHub Actions, GitLab CI, CircleCI) support OIDC-based identity federation. This allows the runner to exchange a short-lived, cryptographically signed token (a JWT) from the CI/CD provider for a temporary, scoped security token from your cloud provider (e.g., AWS STS).

#### How the Workflow Functions:

  1. The Request: The runner starts a job and requests an OIDC token from the CI/CD provider. This token contains "claims" about the environment (e.g., `repository: my-org/my-app`, `ref: refs/heads/main`).
  2. The Exchange: The runner presents this JWT to the Cloud Provider's Security Token Service (STS).
  3. The Validation: The Cloud Provider verifies the signature of the JWT against the CI/CD provider's public keys and checks the Trust Policy.
  4. The Grant: If the claims match the predefined policy, the Cloud Provider issues a temporary, short-lived IAM role session to the runner.

Practical Example: Scoped AWS IAM Trust Policy

Instead of allowing any job from your GitHub organization to assume a role, you should restrict the trust relationship to specific repositories and branches.

Example IAM Trust Policy (JSON):

```json

{

"Version": $\text{"2012-10-17"}$,

"Statement": [

{

"Effect": "Allow",

"Principal": {

"Federated": "arn:aws:

```

Conclusion

As shown across "The Peril of Persistent Runners", "Pillar 1: The Ephemeral Runner Pattern", "Pillar 2: Least Privilege through OIDC and Identity Federation", a secure implementation for securing ci/cd agents with ephemeral runners and least privilege depends on execution discipline as much as design.

The practical hardening path is to enforce strict token/claim validation and replay resistance, deterministic identity policy evaluation with deny-by-default semantics, and admission-policy enforcement plus workload isolation and network policy controls. This combination reduces both exploitability and attacker dwell time by forcing failures across multiple independent control layers.

Operational confidence should be measured, not assumed: track false-allow rate and time-to-revoke privileged access and mean time to detect and remediate configuration drift, then use those results to tune preventive policy, detection fidelity, and response runbooks on a fixed review cadence.

Related Articles

Explore related cybersecurity topics:

Recommended Next Steps

If this topic is relevant to your organisation, use one of these paths: