What is chaos engineering in one sentence?

Chaos engineering is the practice of deliberately injecting controlled failures into a system — such as killing servers or adding network latency — to prove it can withstand real-world turbulence and to expose weaknesses before they cause an outage.

What is Chaos Monkey?

Chaos Monkey is a tool Netflix built that randomly terminates production instances during business hours, forcing engineers to build services that tolerate sudden failure. It launched the broader Simian Army and popularized chaos engineering.

What is blast radius in chaos engineering?

Blast radius is how much of the system a chaos experiment can affect. Good practice starts with a tiny blast radius — one instance, a small slice of traffic — and expands only as confidence grows, so an experiment cannot cause a large outage.

Is chaos engineering done in production?

Mature teams run experiments in production because that is the only place real conditions exist, but always with a defined hypothesis, a limited blast radius, and an automatic abort. Many teams start in staging and graduate to production carefully.

A game day is a scheduled exercise where a team deliberately triggers failures and practices responding, testing both the system's resilience and the team's runbooks, alerts, and incident process in a controlled setting.

Glossary · Infrastructure

What is Chaos Engineering?

Chaos engineering is the deliberate practice of breaking your own system on purpose — killing servers, injecting network latency, cutting off dependencies — under controlled conditions, so you can prove it survives the kinds of failures that happen in the real world, and find the weaknesses on your own terms instead of at 3 a.m. during an outage.

The counterintuitive premise

At first it sounds reckless: why would anyone deliberately break a working system? The answer is that distributed systems are going to fail whether you like it or not — networks drop packets, disks die, a dependency times out, a cloud zone goes dark. The only question is whether you discover how your system responds to those failures in a controlled experiment, with engineers watching and a plan to abort, or in an unplanned production incident with customers screaming. Chaos engineering chooses the former. It is the empirical answer to a hope-based assumption that "the failover will just work."

Where it came from: Netflix and Chaos Monkey

The discipline was popularized by Netflix around 2010 as it moved to the cloud. The team built Chaos Monkey, a tool that randomly terminates production instances during business hours. The logic was brilliant: if a server can be killed at any moment, engineers are forced to build services that tolerate it, and the painful work of resilience gets done continuously instead of deferred forever. Chaos Monkey grew into the "Simian Army" — tools that simulated everything from regional outages to latency spikes — and Netflix's published principles turned an internal practice into an industry discipline.

It is a scientific method, not vandalism

Done properly, chaos engineering is rigorous, not random destruction. Each experiment follows a clear shape. You define the steady state — a measurable signal of healthy behavior, like orders per second. You form a hypothesis: "if we kill one payment service instance, the steady state will hold because traffic reroutes." You introduce the failure. Then you compare reality to the hypothesis. If the system held, you have earned real confidence; if it did not, you have found a weakness cheaply, in daylight, with the people who can fix it already paying attention.

Blast radius and the abort button

The discipline that separates chaos engineering from negligence is controlling the blast radius — how much of the system an experiment can affect. You start tiny: one instance, one percent of traffic, one non-critical dependency. You confirm you can observe the impact and automatically halt the experiment the instant the steady state degrades past a threshold. Only as confidence grows do you widen the scope. Running experiments in production is the goal, because that is the only place real conditions exist, but always with a small blast radius and a working stop switch.

Observability is the prerequisite

You cannot do chaos engineering without strong observability. The entire method depends on measuring the steady state and watching what happens when you inject failure — if you cannot see the impact in real time, you cannot run a safe experiment or learn anything from it. This is why chaos engineering tends to arrive after a team has solid metrics, dashboards, and distributed tracing in place. It also complements load testing: one asks "does it survive a crowd?" while chaos asks "does it survive its own pieces breaking?"

At QUANT LAB

We see chaos engineering as a maturity step, not a starting point. For the systems we build under cloud infrastructure and operate under DevOps engineering, we first make sure the fundamentals are real — infrastructure as code, redundancy, and observability — then use controlled failure experiments and game days to prove the redundancy actually works rather than just existing on a diagram. There is a security parallel too: deliberately injecting failure to test resilience is, in spirit, the same instinct behind a penetration test — you would rather find the failure yourself than have someone else find it for you.

Long-form deep-dives that use this term

All posts

Related terms

Is your redundancy real or just on paper?

We use controlled failure experiments to prove your system survives the failures it claims to. Book a 30-minute call.

Cloud infrastructure