Chaos Engineering — Technology Wiki

Overview

Direct Answer

Chaos engineering is a systematic practice of injecting controlled failures and disruptions into production or production-like systems to uncover weaknesses before customers encounter them. This discipline validates that distributed systems can gracefully handle unexpected adverse conditions and recover with minimal service degradation.

How It Works

Practitioners design and execute experiments that deliberately introduce faults—network latency, service outages, resource exhaustion, or data corruption—into running systems whilst monitoring system behaviour and recovery mechanisms. Results from these experiments reveal architectural fragilities, misconfigured resilience patterns, and unvalidated assumptions about component interdependencies.

Why It Matters

Organisations rely on this approach to reduce unplanned downtime costs, build customer trust through demonstrated reliability, and identify systemic risks before they cause widespread outages. It transforms resilience from an aspirational attribute into a measurable, continuously validated engineering property.

Common Applications

E-commerce platforms use controlled failure injection to validate checkout system redundancy; financial services firms test payment network resilience; cloud infrastructure providers simulate regional failures to validate disaster recovery procedures.

Key Considerations

Experiments must be carefully scoped and executed in controlled environments to avoid unintended production harm; teams require clear blast radius limits and rollback capabilities. Results are time and architecture-specific, requiring continuous re-validation as systems evolve.

Related in Site Reliability

Site Reliability Engineering

A discipline applying software engineering principles to infrastructure and operations to create scalable, reliable systems.

Incident Management

The processes and tools for detecting, responding to, resolving, and learning from service disruptions.

Runbook

A documented set of procedures for handling routine operations and troubleshooting common issues.

Capacity Planning

The process of determining the production capacity needed to meet changing demands for an organisation's products.

High Availability

A system design approach that ensures a certain degree of operational continuity during a given measurement period.

More in DevOps & Infrastructure

Monitoring

Observability

The continuous observation of system performance, availability, and health using automated tools and dashboards.

Artifact Repository

CI/CD

A centralised storage system for managing binary artifacts produced during the software build process.

GitOps

Infrastructure as Code

An operational framework using Git repositories as the single source of truth for declarative infrastructure and applications.

Logging

Observability

The practice of recording events, errors, and system activities for debugging, auditing, and analysis.

Mean Time to Recovery

CI/CD

The average time it takes to restore a system to normal operation after a failure or incident.

Observability

The ability to understand a system's internal state from its external outputs, encompassing metrics, logs, and traces.

ChatOps

CI/CD

A collaboration model connecting tools, processes, and automation with team chat platforms for operations management.

Graceful Degradation

CI/CD

A design approach where a system continues to operate with reduced functionality when components fail.