Overview
Direct Answer
An error budget is the quantified maximum downtime a service may experience within a specified period whilst remaining compliant with its Service Level Objective (SLO). It represents the inverse of availability—a service with a 99.9% SLO has an error budget of 0.1% downtime per billing period.
How It Works
The budget is calculated by multiplying the acceptable unavailability percentage by the total time in the measurement window. For example, a 99.95% SLO over 30 days permits approximately 21.6 minutes of downtime. Teams track actual downtime against this allocation, enabling informed decisions about when to deploy changes, perform maintenance, or accept operational risk.
Why It Matters
Error budgets align incentives between development velocity and reliability. They prevent premature risk-aversion whilst establishing clear trade-offs: teams can deploy more frequently when budget remains, but must prioritise stability when exhausted. This framework reduces subjective disputes about acceptable outage frequency and directly impacts revenue protection and customer retention.
Common Applications
Cloud infrastructure providers use error budgets to manage scheduled maintenance windows. E-commerce platforms allocate budget consumption across feature releases, infrastructure upgrades, and incident recovery. Financial services organisations establish stricter budgets for payment processing systems whilst allowing higher error margins for non-critical services.
Key Considerations
Error budgets assume uniform business impact across outage types, though customer-facing and backend failures warrant different treatment. Organisations must align SLOs realistically with infrastructure capability, avoiding meaningless targets that exhaust budgets immediately or become irrelevant to actual user experience.
More in DevOps & Infrastructure
Site Reliability Engineering
Site ReliabilityA discipline applying software engineering principles to infrastructure and operations to create scalable, reliable systems.
Service Discovery
CI/CDThe automatic detection of devices and services on a network, enabling dynamic service-to-service communication.
Mean Time Between Failures
CI/CDThe average time between system failures, measuring reliability and availability.
Mean Time to Recovery
CI/CDThe average time it takes to restore a system to normal operation after a failure or incident.
Build Automation
CI/CDThe process of automating the compilation, testing, and packaging of software applications.
CI/CD Pipeline
CI/CDAn automated workflow that builds, tests, and deploys software changes from development to production.
Graceful Degradation
CI/CDA design approach where a system continues to operate with reduced functionality when components fail.
Runbook
Site ReliabilityA documented set of procedures for handling routine operations and troubleshooting common issues.