Post

Kubernetes Chaos Engineering

Kubernetes Chaos Engineering

GitOps Chaos Engineering Without the Heavyweight Stack

Building a deterministic, request-level HTTP fault injector with Istio, Envoy, and Kubernetes admission control.

Why I built this (the honest motive)

Chaos engineering is everywhere… but so is chaos tooling sprawl.

In many organizations, “doing chaos” quickly turns into adopting large platforms (Gremlin, Litmus, Chaos Mesh) that come with their own control planes, agents, permissions, UIs, and operational overhead. Those tools are powerful, but if your goal is simple, safe, GitOps-first experiments, they can feel like using a rocket launcher to open a can.

So I built a proof of concept: A minimal GitOps chaos system that lets teams declare request-level HTTP faults as Kubernetes resources, enforced by policy, applied by a controller, and executed by Envoy — without agents, privileged access, or a separate chaos platform.

The core question I wanted to answer was:

Can we do safe, auditable, namespace-scoped chaos with a straightforward approach that fits naturally into GitOps workflows?

This repo is that answer.

What I needed to test

I wasn’t trying to simulate “all chaos.” I was focused on the failures I see most often in real systems:

1) Dependency failures (downstream)

  • What happens if my service’s dependency is slow?
  • What happens if the dependency becomes unavailable?
  • Do retries/backoff/circuit breakers behave as expected?
  • Does the service degrade gracefully, or cascade failures?

2) Service behavior under latency and errors (upstream + downstream)

I wanted to inject:

  • fixed latency delays
  • fixed HTTP status aborts (timeouts / 5xx)

…for specific routes, and for both:

  • INBOUND (client → service)
  • OUTBOUND (service/pod → dependency)

This is crucial because route-level failures are the closest to real incident patterns:

  • a single endpoint regresses
  • a single downstream dependency slows down
  • a specific call path starts timing out

What I didn’t want

The most important constraint: don’t impact other services.

This PoC is built for shared clusters, where uncontrolled chaos becomes a production incident. So I made blast radius and safety first-class, not “best effort.”

The design goal was:

Experiments must be constrained by hard guardrails enforced at admission time, not by “people following a runbook”.

That’s why the system uses Kubernetes ValidatingAdmissionPolicy (CEL) to reject unsafe specs before they can run.

What this is: Declarative HTTP chaos (Istio / Envoy)

At its core, this project introduces one custom resource:

1
FaultInjection (CRD)

A FaultInjection declares:

  • blast radius (duration, traffic %)
  • actions (HTTP latency, HTTP abort)
  • targeting (routes, headers, direction, destination hosts)
  • scope (namespace)

If it passes admission policy, a controller reconciles it into deterministic Istio VirtualService rules.

Why request-level faults?

Because they’re:

  • deterministic
  • reversible
  • narrow in blast radius
  • safe in shared clusters
  • observable via Envoy metrics

This is not packet-level chaos (no tc/netem/iptables). It’s HTTP-level fault injection.

Architecture: policy-first control plane, Envoy runtime

Control plane (safe-by-default)

  • Developer commits a FaultInjection YAML
  • GitOps applies it (ArgoCD or plain kubectl apply)
  • Admission policy validates constraints (duration, percentage, semantics)
  • Controller reconciles → patches VirtualService
  • Controller auto-cleans on expiry

Runtime execution (deterministic)

Once VirtualService is patched, Envoy sidecars enforce:

  • inject delay for matching requests
  • or abort with a fixed HTTP status (e.g., 504)

Other routes remain untouched.

What’s implemented as of the day of publish the post

Supported faults

  • HTTP_LATENCY — fixed delay
  • HTTP_ABORT — deterministic abort (e.g., 504)

Directions

  • INBOUND: client → service (VirtualService ref)
  • OUTBOUND: pod → destination (mesh gateway)

Targeting

  • URI prefix or exact
  • optional headers
  • source pod labels (outbound)
  • destination hosts
  • percentage based impact

Guardrails (enforced at admission)

Examples of what’s already enforced:

  • required blast radius + duration
  • percent bounds and max traffic cap
  • required route match (prefix/exact)
  • correct semantics per action type and per direction
  • safe routing rules only

If a manifest violates the guardrails, Kubernetes rejects it, meaning:

the experiment never exists → the experiment never runs.

That’s the right failure mode.

Why this approach is valuable (benefits)

1) GitOps-native by design

Your chaos experiments are just YAML:

  • reviewable in PRs
  • auditable
  • versioned
  • reproducible

You can put them in a service repo like:

1
2
3
4
service-repo/
  chaos/
    latency.yaml
    timeout.yaml

And treat experiments like any other deployment artifact.

2) Deterministic and route-level

This is not “random network weirdness.” It’s:

  • precise
  • request-level
  • bounded by path + percent + duration

You can answer questions like:

  • “Does /api/vendors degrade gracefully under +2s latency?”
  • “What happens to checkout if payment returns 504 for 10% of calls?”

…without turning the entire cluster into a science experiment.

3) Safe in shared clusters

This is the big one.

Instead of trusting teams to do the right thing manually, the platform can enforce:

  • max duration
  • max traffic percentage
  • direction rules
  • mandatory selectors and route constraints

All enforced before execution.

That reduces:

  • accidental wide blast radius
  • cross-namespace impact
  • long-running experiments
  • “oops I pointed it to the wrong VirtualService”

4) Operationally lightweight

No agents, no privileged DaemonSets, no kernel mutation. Just:

  • a controller
  • a CRD
  • admission policies
  • Istio/Envoy (already running in many clusters)

A concrete example: simulate dependency latency + timeout

You can express a scenario like:

  • OUTBOUND latency to a dependency for requests matching /anything/vendors/
  • OUTBOUND abort when header x-chaos-mode: timeout is present
  • INBOUND versions of the same (to validate how the service behaves for clients)

And then run the included test script from the curl-client pod to validate:

  • control path is fast
  • abort path returns 504
  • delay path takes ~2s
  • non-matching paths remain fast

This matters because it proves:

  • route isolation works
  • guardrails allow controlled chaos
  • the mesh does deterministic injection
  • cleanup restores baseline state

looking into the archeticture

C4 Level 1- System context

System context

C4 Level 2 — Containers

Containers

Sequence — GitOps apply → policy gate → run → TTL cleanup

GitOps flow

Sequence — Inbound vs Outbound

Inbound vs Outbound

“Blast radius story” diagram — why other routes don’t break

Blast radius

How to take it forward (roadmap)

What exists today covers the “safe HTTP chaos MVP” well. The next step is turning this into a repeatable resilience workflow, not just a fault injector.

Here’s the forward plan, aligned with what I already built, and using RFC ideas only for what’s not implemented yet:

1) Add k6-loadgen as a first-class workflow

Right now, we can run curl-based tests. The next step is:

  • let experiments include a load generation plan
  • run it during the chaos window
  • capture results consistently

Why it matters: Chaos without controlled traffic is hard to interpret. k6 makes experiments measurable and repeatable.

A practical direction:

  • a k6-loadgen Job triggered alongside the FaultInjection
  • scoped to the same namespace
  • runs for the same TTL window
  • stores a summary (p95/p99, error rate) back into status

2) Manual abort as a first-class control

This came up in while thinking about the problem: GitOps deletion is not always fast enough.

Add a spec control field:

1
2
3
4
5
spec:
  control:
    abort: true
    reason: "Unexpected impact"
    requestedBy: "oncall"

Controller behavior:

  • detect abort flag
  • immediately clean up injected rules
  • mark status as Aborted
  • idempotent cleanup

This gives on-call a “stop button” without touching deployments.

3) Stop conditions and auto-abort (optional but powerful)

This is where chaos becomes production-grade:

  • define stop conditions (latency, 5xx rate, burn rate)
  • evaluate periodically
  • abort immediately if breached

I don’t need to implement the full platform stack at once. Even a small set (p99 + 5xx) goes a long way.

4) Expand fault types carefully (only if needed)

The current scope is intentionally narrow and safe. If expanded, I should do it with the same philosophy:

  • bounded pod delete (maxPodsAffected)
  • scaling faults with snapshot+restore
  • AWS dependency faults as request-only objects (no AWS IAM in the controller)

But the key: don’t lose the “shared cluster safe” property.

how the future will look like

adding k6-loadgen (roadmap)

k6-loadgen

Closing thought: chaos as a capability, not a product

This PoC isn’t trying to compete with Gremlin/Litmus/Chaos Mesh on feature breadth.

It’s taking a different stance:

  • small, deterministic primitives
  • GitOps first
  • policy enforced
  • namespace scoped
  • safe-by-default

If you can do 80% of your resilience testing with:

  • latency
  • aborts
  • route targeting
  • strict guardrails

…then you can shift chaos from a special event into a normal engineering practice.

And that’s the real win.

This post is licensed under CC BY 4.0 by the author.