Monitoring & DevOps Glossary

Master the essential terminology of uptime monitoring, reliability engineering, and incident management. From MTTR to SLA, understand the metrics that matter for your infrastructure.

Effective uptime monitoring and incident management require a shared vocabulary. These terms -- from Mean Time to Recovery (MTTR) to Service Level Agreements (SLA) -- are used by engineering teams, DevOps professionals, and business stakeholders to communicate about reliability, plan infrastructure, and measure success.

Reliability Metrics

MTTR

Mean Time to Recovery - time required to fix a system after failure

MTBF

Mean Time Between Failures - average time between system failures

SLA

Service Level Agreement - commitment to uptime percentage and response times

RTO

Recovery Time Objective - maximum acceptable downtime during incident

RPO

Recovery Point Objective - maximum acceptable data loss measured in time

MTTI

Mean Time to Identify - average time from incident start to root cause identification

Error Budget

Allowed downtime derived from SLO - balances reliability with feature velocity

Monitoring Concepts

Uptime Monitoring

Continuous checking of service availability from multiple locations

Health Check

Automated verification that a system is running and responsive

Heartbeat

Regular signal sent to verify a system is alive and functioning

Synthetic Monitoring

Simulating user interactions to proactively detect issues

Real User Monitoring

Capturing actual user interactions to measure real-world performance

Observability

Ability to understand internal system state from external outputs (logs, metrics, traces)

HTTP Status Codes

Three-digit server response codes indicating request outcome (200 OK, 404 Not Found, 500 Error)

Incident Management

Incident Management

Process for detecting, responding to, and resolving service outages

MTTA

Mean Time to Acknowledge - average time to respond to incident alert

Status Page

Public-facing page showing service health and incident history

Incident Severity

Classification level (P1-P4) indicating impact and urgency of issues

Post-Incident Review

Analysis after outage to identify root cause and prevent recurrence

Performance & Reliability

Latency

Time delay between request and response, measured in milliseconds

Throughput

Amount of data successfully processed per unit time

Redundancy

Duplicate systems or components to maintain service during failures

Failover

Automatic switching to backup system when primary fails

Load Balancing

Distributing incoming traffic across multiple servers

Circuit Breaker

Design pattern that prevents cascading failures by failing fast when a service is unhealthy

Deployment & Operations

Blue-Green Deployment

Running two identical environments and switching traffic for zero-downtime releases

Canary Release

Gradually rolling out changes to a small subset of users before full deployment

Chaos Engineering

Discipline of experimenting on distributed systems to build confidence in resilience

Start Monitoring Today

Understanding these metrics is the first step. Use AtomPing to track them across your infrastructure with multi-region monitoring, instant alerts, and public status pages.

Get Started Free