Overview
The average time it takes to restore a system to normal operation after a failure or incident.
More in DevOps & Infrastructure
Puppet
Infrastructure as CodeA configuration management tool that automates the provisioning and management of infrastructure.
Metrics
ObservabilityQuantitative measurements collected over time to track system performance, health, and business outcomes.
Capacity Planning
Site ReliabilityThe process of determining the production capacity needed to meet changing demands for an organisation's products.
Container Registry
Containers & OrchestrationA repository for storing, managing, and distributing container images.
Monitoring
ObservabilityThe continuous observation of system performance, availability, and health using automated tools and dashboards.
Blue-Green Infrastructure
CI/CDMaintaining two identical production environments to enable instant switching between versions.
Horizontal Scaling
CI/CDAdding more machines or nodes to a system to handle increased load.
Alerting
ObservabilityAutomated notifications triggered when system metrics or conditions exceed predefined thresholds.