Overview
Direct Answer
A Service Level Indicator (SLI) is a measurable attribute of service performance, such as latency, error rate, or availability, that quantifies one specific dimension of user experience or system behaviour. SLIs form the foundation for defining and tracking whether services meet their intended reliability targets.
How It Works
SLIs are derived from raw metrics collected by observability tools—request response times, failed transactions, uptime periods—and aggregated over defined windows to produce a single percentage or rate. These measurements are then compared against thresholds to determine compliance with service level objectives (SLOs), enabling teams to detect degradation before it impacts end users significantly.
Why It Matters
Organisations rely on SLIs to establish accountability for reliability, justify infrastructure investment, and balance feature velocity against stability concerns. For customer-facing services, poor SLI performance directly correlates with customer churn and revenue loss; for internal systems, it affects productivity and operational costs.
Common Applications
SLIs are used in cloud platforms to monitor uptime and latency, in payment systems to track transaction success rates, and in content delivery networks to measure page load times. DevOps teams use SLIs to trigger automated scaling decisions, whilst incident response teams employ them to prioritise and escalate outages.
Key Considerations
Selecting meaningful SLIs requires understanding actual user impact rather than operational convenience—measuring CPU utilisation, for example, does not necessarily reflect user satisfaction. SLI targets must be ambitious enough to drive quality improvements yet achievable enough to maintain team morale and prevent alert fatigue.
Referenced By1 term mentions Service Level Indicator
Other entries in the wiki whose definition references Service Level Indicator — useful for understanding how this concept connects across DevOps & Infrastructure and adjacent domains.
More in DevOps & Infrastructure
Runbook
Site ReliabilityA documented set of procedures for handling routine operations and troubleshooting common issues.
Capacity Planning
Site ReliabilityThe process of determining the production capacity needed to meet changing demands for an organisation's products.
Monitoring
ObservabilityThe continuous observation of system performance, availability, and health using automated tools and dashboards.
Logging
ObservabilityThe practice of recording events, errors, and system activities for debugging, auditing, and analysis.
Graceful Degradation
CI/CDA design approach where a system continues to operate with reduced functionality when components fail.
Vertical Scaling
CI/CDIncreasing the resources (CPU, RAM, storage) of an existing machine to handle more load.
Container Registry
Containers & OrchestrationA repository for storing, managing, and distributing container images.
High Availability
Site ReliabilityA system design approach that ensures a certain degree of operational continuity during a given measurement period.