Effective uptime monitoring and incident management require a shared vocabulary. These terms -- from Mean Time to Recovery (MTTR) to Service Level Agreements (SLA) -- are used by engineering teams, DevOps professionals, and business stakeholders to communicate about reliability, plan infrastructure, and measure success.
Reliability Metrics
MTTR
Mean Time to Recovery - time required to fix a system after failure
MTBF
Mean Time Between Failures - average time between system failures
SLA
Service Level Agreement - commitment to uptime percentage and response times
RTO
Recovery Time Objective - maximum acceptable downtime during incident
RPO
Recovery Point Objective - maximum acceptable data loss measured in time
MTTI
Mean Time to Identify - average time from incident start to root cause identification
Error Budget
Allowed downtime derived from SLO - balances reliability with feature velocity
Monitoring Concepts
Uptime Monitoring
Continuous checking of service availability from multiple locations
Health Check
Automated verification that a system is running and responsive
Heartbeat
Regular signal sent to verify a system is alive and functioning
Synthetic Monitoring
Simulating user interactions to proactively detect issues
Real User Monitoring
Capturing actual user interactions to measure real-world performance
Observability
Ability to understand internal system state from external outputs (logs, metrics, traces)
HTTP Status Codes
Three-digit server response codes indicating request outcome (200 OK, 404 Not Found, 500 Error)
Incident Management
Incident Management
Process for detecting, responding to, and resolving service outages
MTTA
Mean Time to Acknowledge - average time to respond to incident alert
Status Page
Public-facing page showing service health and incident history
Incident Severity
Classification level (P1-P4) indicating impact and urgency of issues
Post-Incident Review
Analysis after outage to identify root cause and prevent recurrence
Performance & Reliability
Latency
Time delay between request and response, measured in milliseconds
Throughput
Amount of data successfully processed per unit time
Redundancy
Duplicate systems or components to maintain service during failures
Failover
Automatic switching to backup system when primary fails
Load Balancing
Distributing incoming traffic across multiple servers
Circuit Breaker
Design pattern that prevents cascading failures by failing fast when a service is unhealthy
Deployment & Operations
Blue-Green Deployment
Running two identical environments and switching traffic for zero-downtime releases
Canary Release
Gradually rolling out changes to a small subset of users before full deployment
Chaos Engineering
Discipline of experimenting on distributed systems to build confidence in resilience