What is Failover?
Failover is the automatic or manual switching to a backup system when a primary system fails. It is the mechanism that turns redundancy into actual availability, ensuring your service stays online even when individual components break.
Definition
Failover is the process of automatically or manually switching operations from a failed primary component to a standby or backup component. The objective is to maintain service continuity with minimal disruption to users.
For example, if your primary database server crashes, a failover mechanism promotes the replica to primary and redirects all database connections to it. Users may experience a brief pause (seconds to minutes depending on the failover type) but the service continues operating.
How Failover Works
A failover system consists of three core components working together:
1Health Monitoring (Detection)
A monitoring system continuously checks the health of the primary component. This can be heartbeat signals, health check endpoints, or external monitoring probes. When the primary fails to respond within a configured timeout, it is declared unhealthy. AtomPing performs this function by monitoring your endpoints from multiple regions, detecting failures within seconds.
2Decision Logic (Determination)
The failover controller evaluates whether the failure is real (not a transient glitch) and whether to initiate a switch. This often requires multiple consecutive failed checks to prevent false positives. Advanced systems use quorum voting: if 2 of 3 monitoring nodes agree the primary is down, failover proceeds.
3Switchover Execution (Action)
The backup system is activated and traffic is redirected to it. This may involve promoting a database replica, updating DNS records, changing load balancer backends, or activating a standby application server. The switchover must be reliable and fast, as this is when downtime actually occurs.
Types of Failover
Failover strategies vary in speed, cost, and complexity:
| Type | Standby State | Failover Time | Data Loss Risk | Cost |
|---|---|---|---|---|
| Hot | Running, synchronized, serving traffic | Seconds | None (synchronous) | Highest |
| Warm | Running, synchronized, not serving | 30s - 5 minutes | Minimal (async replication) | Medium |
| Cold | Powered off, restored from backup | 10 - 60+ minutes | Moderate (last backup) | Lowest |
| DNS | Running independently | 1 - 5 minutes (TTL dependent) | Depends on setup | Low |
Automatic vs Manual: Automatic failover happens without human intervention, minimizing MTTR. Manual failover requires an operator to trigger the switch, which is slower but reduces the risk of false-positive failovers. Many organizations use automatic failover for well-understood failure modes and manual failover for complex scenarios.
Failover at Different Layers
Different infrastructure layers use different failover mechanisms:
DNS Failover
DNS health checks monitor your primary server. When it fails, DNS records are updated to point to a secondary IP. Simple to implement and works across any infrastructure. Limited by DNS TTL propagation. Use AtomPing's DNS lookup tool to verify your DNS configuration.
Application Server Failover
Load balancers continuously health-check backend servers. When a server fails its health check, the load balancer stops routing traffic to it. Remaining healthy servers absorb the load. This is the most common application-layer failover mechanism.
Database Failover
Database failover promotes a read replica to primary when the primary fails. Tools like PostgreSQL's pg_auto_failover, MySQL Group Replication, or orchestration tools (Patroni, Orchestrator) automate this. Database failover is often the most critical and complex failover scenario because of data consistency requirements.
Region-Level Failover
When an entire region or data center fails, traffic is redirected to another region. This requires geographic redundancy, cross-region data replication, and global load balancing. It provides the strongest protection but is the most complex to implement correctly.
Testing Failover Procedures
Untested failover is unreliable failover. Regular testing is essential to ensure your failover works when you need it:
- Scheduled failover drills: Regularly shut down primary components during low-traffic windows and verify the standby takes over correctly. Document the procedure and results.
- Chaos engineering: Randomly introduce failures in production to validate failover under real conditions. Start with non-critical services and expand as confidence grows.
- Monitor during tests: Use multi-region monitoring to verify that services remain available during failover exercises and measure actual failover duration.
- Test failback too: After failing over, test returning to the primary. Failback failures are a common source of extended outages.
- Validate data integrity: After failover, verify that no data was lost or corrupted. Check replication lag and data consistency between systems.
Warning: Never assume failover works because it was configured correctly. Configuration drift, software updates, network changes, and expired certificates can silently break failover mechanisms. The only proof is a successful test.
Monitoring Failover Readiness
Continuous monitoring ensures your failover systems are healthy and ready to activate at any moment:
Monitor Standby Health
Your standby systems need their own health checks. A standby database that silently stopped replicating or a warm server with a full disk is not going to save you during failover. Monitor each standby component's health independently.
Track Replication Lag
For database failover, replication lag determines how much data you might lose. Alert when lag exceeds your RPO (Recovery Point Objective). If your RPO is 30 seconds but replication lag is 5 minutes, you are not meeting your recovery objectives.
Verify SSL Certificates on Standby
Standby servers need valid SSL certificates too. An expired certificate on your backup will cause connection failures the moment you fail over. Use AtomPing's TLS expiry monitoring to track certificates on both primary and standby infrastructure.
Frequently Asked Questions
What is the difference between failover and failback?▼
How long does failover take?▼
Can failover cause data loss?▼
What is DNS failover?▼
How often should I test failover?▼
What is split-brain in failover?▼
Does AtomPing help with failover?▼
Related Glossary Terms
Monitor Your Failover Systems
AtomPing monitors from 10 European locations, detecting outages in seconds and alerting your team via email, Slack, Discord, or Telegram. Validate that failover keeps your service available. Free plan includes 50 monitors.
Start Monitoring Free