Error Budget — Technology Wiki

Overview

Direct Answer

An error budget is the quantified maximum downtime a service may experience within a specified period whilst remaining compliant with its Service Level Objective (SLO). It represents the inverse of availability—a service with a 99.9% SLO has an error budget of 0.1% downtime per billing period.

How It Works

The budget is calculated by multiplying the acceptable unavailability percentage by the total time in the measurement window. For example, a 99.95% SLO over 30 days permits approximately 21.6 minutes of downtime. Teams track actual downtime against this allocation, enabling informed decisions about when to deploy changes, perform maintenance, or accept operational risk.

Why It Matters

Error budgets align incentives between development velocity and reliability. They prevent premature risk-aversion whilst establishing clear trade-offs: teams can deploy more frequently when budget remains, but must prioritise stability when exhausted. This framework reduces subjective disputes about acceptable outage frequency and directly impacts revenue protection and customer retention.

Common Applications

Cloud infrastructure providers use error budgets to manage scheduled maintenance windows. E-commerce platforms allocate budget consumption across feature releases, infrastructure upgrades, and incident recovery. Financial services organisations establish stricter budgets for payment processing systems whilst allowing higher error margins for non-critical services.

Key Considerations

Error budgets assume uniform business impact across outage types, though customer-facing and backend failures warrant different treatment. Organisations must align SLOs realistically with infrastructure capability, avoiding meaningless targets that exhaust budgets immediately or become irrelevant to actual user experience.

Related in Observability

Observability

The ability to understand a system's internal state from its external outputs, encompassing metrics, logs, and traces.

Monitoring

The continuous observation of system performance, availability, and health using automated tools and dashboards.

Logging

The practice of recording events, errors, and system activities for debugging, auditing, and analysis.

Distributed Tracing

A method of tracking requests as they flow through distributed systems to diagnose latency and failure points.

Metrics

Quantitative measurements collected over time to track system performance, health, and business outcomes.

Alerting

Automated notifications triggered when system metrics or conditions exceed predefined thresholds.

Prometheus

An open-source monitoring and alerting toolkit designed for reliability and scalability in cloud-native environments.

Grafana

An open-source analytics and visualisation platform for monitoring metrics from multiple data sources.

More in DevOps & Infrastructure

Site Reliability Engineering

Site Reliability

A discipline applying software engineering principles to infrastructure and operations to create scalable, reliable systems.

Service Discovery

CI/CD

The automatic detection of devices and services on a network, enabling dynamic service-to-service communication.

Mean Time Between Failures

CI/CD

The average time between system failures, measuring reliability and availability.

Mean Time to Recovery

CI/CD

The average time it takes to restore a system to normal operation after a failure or incident.

Build Automation

CI/CD

The process of automating the compilation, testing, and packaging of software applications.

CI/CD Pipeline

CI/CD

An automated workflow that builds, tests, and deploys software changes from development to production.

Graceful Degradation

CI/CD

A design approach where a system continues to operate with reduced functionality when components fail.

Runbook

Site Reliability

A documented set of procedures for handling routine operations and troubleshooting common issues.