Home/Glossary/Failover

What is Failover?

Failover is the automatic or manual switching to a backup system when a primary system fails. It is the mechanism that turns redundancy into actual availability, ensuring your service stays online even when individual components break.

Definition

Failover is the process of automatically or manually switching operations from a failed primary component to a standby or backup component. The objective is to maintain service continuity with minimal disruption to users.

For example, if your primary database server crashes, a failover mechanism promotes the replica to primary and redirects all database connections to it. Users may experience a brief pause (seconds to minutes depending on the failover type) but the service continues operating.

How Failover Works

A failover system consists of three core components working together:

1Health Monitoring (Detection)

A monitoring system continuously checks the health of the primary component. This can be heartbeat signals, health check endpoints, or external monitoring probes. When the primary fails to respond within a configured timeout, it is declared unhealthy. AtomPing performs this function by monitoring your endpoints from multiple regions, detecting failures within seconds.

2Decision Logic (Determination)

The failover controller evaluates whether the failure is real (not a transient glitch) and whether to initiate a switch. This often requires multiple consecutive failed checks to prevent false positives. Advanced systems use quorum voting: if 2 of 3 monitoring nodes agree the primary is down, failover proceeds.

3Switchover Execution (Action)

The backup system is activated and traffic is redirected to it. This may involve promoting a database replica, updating DNS records, changing load balancer backends, or activating a standby application server. The switchover must be reliable and fast, as this is when downtime actually occurs.

Types of Failover

Failover strategies vary in speed, cost, and complexity:

TypeStandby StateFailover TimeData Loss RiskCost
HotRunning, synchronized, serving trafficSecondsNone (synchronous)Highest
WarmRunning, synchronized, not serving30s - 5 minutesMinimal (async replication)Medium
ColdPowered off, restored from backup10 - 60+ minutesModerate (last backup)Lowest
DNSRunning independently1 - 5 minutes (TTL dependent)Depends on setupLow

Automatic vs Manual: Automatic failover happens without human intervention, minimizing MTTR. Manual failover requires an operator to trigger the switch, which is slower but reduces the risk of false-positive failovers. Many organizations use automatic failover for well-understood failure modes and manual failover for complex scenarios.

Failover at Different Layers

Different infrastructure layers use different failover mechanisms:

DNS Failover

DNS health checks monitor your primary server. When it fails, DNS records are updated to point to a secondary IP. Simple to implement and works across any infrastructure. Limited by DNS TTL propagation. Use AtomPing's DNS lookup tool to verify your DNS configuration.

Application Server Failover

Load balancers continuously health-check backend servers. When a server fails its health check, the load balancer stops routing traffic to it. Remaining healthy servers absorb the load. This is the most common application-layer failover mechanism.

Database Failover

Database failover promotes a read replica to primary when the primary fails. Tools like PostgreSQL's pg_auto_failover, MySQL Group Replication, or orchestration tools (Patroni, Orchestrator) automate this. Database failover is often the most critical and complex failover scenario because of data consistency requirements.

Region-Level Failover

When an entire region or data center fails, traffic is redirected to another region. This requires geographic redundancy, cross-region data replication, and global load balancing. It provides the strongest protection but is the most complex to implement correctly.

Testing Failover Procedures

Untested failover is unreliable failover. Regular testing is essential to ensure your failover works when you need it:

  • Scheduled failover drills: Regularly shut down primary components during low-traffic windows and verify the standby takes over correctly. Document the procedure and results.
  • Chaos engineering: Randomly introduce failures in production to validate failover under real conditions. Start with non-critical services and expand as confidence grows.
  • Monitor during tests: Use multi-region monitoring to verify that services remain available during failover exercises and measure actual failover duration.
  • Test failback too: After failing over, test returning to the primary. Failback failures are a common source of extended outages.
  • Validate data integrity: After failover, verify that no data was lost or corrupted. Check replication lag and data consistency between systems.

Warning: Never assume failover works because it was configured correctly. Configuration drift, software updates, network changes, and expired certificates can silently break failover mechanisms. The only proof is a successful test.

Monitoring Failover Readiness

Continuous monitoring ensures your failover systems are healthy and ready to activate at any moment:

Monitor Standby Health

Your standby systems need their own health checks. A standby database that silently stopped replicating or a warm server with a full disk is not going to save you during failover. Monitor each standby component's health independently.

Track Replication Lag

For database failover, replication lag determines how much data you might lose. Alert when lag exceeds your RPO (Recovery Point Objective). If your RPO is 30 seconds but replication lag is 5 minutes, you are not meeting your recovery objectives.

Verify SSL Certificates on Standby

Standby servers need valid SSL certificates too. An expired certificate on your backup will cause connection failures the moment you fail over. Use AtomPing's TLS expiry monitoring to track certificates on both primary and standby infrastructure.

Frequently Asked Questions

What is the difference between failover and failback?
Failover is switching from a failed primary system to a backup. Failback is the reverse: returning to the original primary system after it has been repaired. Failback can be automatic or manual. Many organizations prefer manual failback to verify the primary is fully healthy before switching back, avoiding a second outage if the original issue was not completely resolved.
How long does failover take?
It depends on the type: hot failover (active-active or hot standby) can complete in seconds. Warm failover typically takes 30 seconds to a few minutes as the standby system initializes. Cold failover may take 10-30 minutes or more as a new system must be started from scratch. DNS-based failover depends on TTL values, often taking 1-5 minutes for clients to resolve to the new address.
Can failover cause data loss?
It depends on your replication strategy. Synchronous replication (every write confirmed on both primary and secondary) prevents data loss but adds latency. Asynchronous replication (writes confirmed on primary first, replicated later) may lose seconds of recent data during failover. The tradeoff between data safety and performance is a core architectural decision.
What is DNS failover?
DNS failover uses health checks to monitor your primary server. When the primary fails, the DNS provider automatically updates records to point to a backup server. DNS failover is simple to implement but limited by DNS TTL (time-to-live): clients cache DNS results, so the switch is not instant. Lower TTLs enable faster failover but increase DNS query volume.
How often should I test failover?
At minimum, test failover quarterly. Critical systems should be tested monthly. Leading organizations practice continuous failover testing (chaos engineering) where failures are randomly introduced in production. Regular testing ensures your failover works, your team knows the procedure, and any configuration drift is caught early.
What is split-brain in failover?
Split-brain occurs when both the primary and secondary systems believe they are the active node and accept writes simultaneously. This causes data divergence and conflicts that are extremely difficult to resolve. Prevention methods include quorum-based consensus, STONITH (Shoot The Other Node In The Head) fencing, and shared storage locks.
Does AtomPing help with failover?
AtomPing monitors your services from 10 European locations and detects outages within seconds. While AtomPing does not perform failover itself, it provides the critical detection layer: instant alerts via email, Slack, Discord, or Telegram give your team or automation systems the trigger to initiate failover. Multi-region monitoring also validates that failover completed successfully.

Monitor Your Failover Systems

AtomPing monitors from 10 European locations, detecting outages in seconds and alerting your team via email, Slack, Discord, or Telegram. Validate that failover keeps your service available. Free plan includes 50 monitors.

Start Monitoring Free

We use cookies

We use Google Analytics to understand how visitors interact with our website. Your IP address is anonymized for privacy. By clicking "Accept", you consent to our use of cookies for analytics purposes.