Overview
A structured review conducted after an incident to identify root causes and prevent recurrence.
More in DevOps & Infrastructure
Rollback
CI/CDThe process of reverting a system to a previous version or state after a failed deployment or update.
Runbook
Site ReliabilityA documented set of procedures for handling routine operations and troubleshooting common issues.
Helm
Containers & OrchestrationA package manager for Kubernetes that simplifies the deployment and management of applications using charts.
Error Budget
ObservabilityThe maximum amount of time a service can be unavailable within a given period based on its SLO.
High Availability
Site ReliabilityA system design approach that ensures a certain degree of operational continuity during a given measurement period.
Immutable Infrastructure
Infrastructure as CodeAn approach where infrastructure components are never modified after deployment but replaced entirely with updated versions.
Incident Management
Site ReliabilityThe processes and tools for detecting, responding to, resolving, and learning from service disruptions.
Chef
Infrastructure as CodeA configuration management tool using Ruby-based scripts to automate infrastructure setup and maintenance.