High Availability — Technology Wiki

Overview

Direct Answer

High availability is a system design methodology that minimises unplanned downtime and ensures continuous service operation by eliminating single points of failure. It targets measurable uptime thresholds, commonly expressed as percentage availability (e.g., 99.9% uptime), through redundancy and automated failover mechanisms.

How It Works

High availability architectures employ multiple independent system instances, load balancers, and health-check monitoring to detect failures and automatically redirect traffic to functional components. When a primary server or service fails, the system detects the fault within seconds and routes requests to standby nodes without manual intervention, maintaining service continuity across infrastructure, database, and application layers.

Why It Matters

Organisations depend on continuous service availability to avoid revenue loss, reputational damage, and regulatory penalties. Industries such as financial services, healthcare, and e-commerce require availability guarantees measured in nines (99.99% implies 52 minutes maximum downtime annually), making this design approach essential for service-level agreement compliance and customer trust.

Common Applications

Web applications use active-passive database replication and clustering; cloud platforms implement multi-region failover; telecommunications networks employ redundant switching systems. Financial transaction systems, streaming services, and critical infrastructure monitoring all require high availability infrastructure to ensure operations continue during component failures.

Key Considerations

Achieving higher availability levels increases complexity, cost, and operational overhead significantly; distributed systems introduce consistency challenges and potential data synchronisation issues. Practitioners must balance availability targets against budget constraints and analyse actual failure modes rather than pursuing maximum availability indiscriminately.

Referenced By2 terms mention High Availability

Other entries in the wiki whose definition references High Availability — useful for understanding how this concept connects across DevOps & Infrastructure and adjacent domains.

Availability Zone·Cloud Computing Cloud-Native Database·Cloud Computing

Related in Site Reliability

Site Reliability Engineering

A discipline applying software engineering principles to infrastructure and operations to create scalable, reliable systems.

Chaos Engineering

The discipline of experimenting on distributed systems to build confidence in their ability to withstand turbulent conditions.

Incident Management

The processes and tools for detecting, responding to, resolving, and learning from service disruptions.

Runbook

A documented set of procedures for handling routine operations and troubleshooting common issues.

Capacity Planning

The process of determining the production capacity needed to meet changing demands for an organisation's products.

More in DevOps & Infrastructure

Observability

The ability to understand a system's internal state from its external outputs, encompassing metrics, logs, and traces.

Secret Management

CI/CD

The practice of securely storing, accessing, and managing sensitive credentials, API keys, and certificates.

Error Budget

Observability

The maximum amount of time a service can be unavailable within a given period based on its SLO.

Ansible

Infrastructure as Code

An open-source automation tool for configuration management, application deployment, and task automation.

Logging

Observability

The practice of recording events, errors, and system activities for debugging, auditing, and analysis.

Metrics

Observability

Quantitative measurements collected over time to track system performance, health, and business outcomes.

Vertical Scaling

CI/CD

Increasing the resources (CPU, RAM, storage) of an existing machine to handle more load.

Rollback

CI/CD

The process of reverting a system to a previous version or state after a failed deployment or update.