What is a Health Check?
A health check is an automated test that verifies whether a system or service is functioning correctly. Health checks are the building blocks of monitoring, load balancing, and container orchestration — they determine when to send traffic, when to restart processes, and when to alert your team.
Definition
A health check is a periodic, automated verification that a system component is alive, responsive, and capable of serving its intended function. Health checks range from simple TCP port probes to comprehensive endpoint tests that validate database connectivity, cache availability, and downstream service dependencies.
For example, a typical health check endpoint at /health might return { "status": "ok", "db": "connected", "cache": "available" } with a 200 status code, confirming that all critical subsystems are operational.
Types of Health Checks
Different types of health checks serve different purposes. Understanding when to use each type is critical for building resilient systems:
Liveness Checks
Answers the question: "Is this process running?" A liveness check verifies that the application process has not crashed or entered a deadlocked state. If a liveness check fails, the process should be restarted.
/healthz every 15 seconds. If 3 consecutive probes fail, Kubernetes kills and restarts the pod.Readiness Checks
Answers the question: "Can this service handle requests?" A service might be alive but not ready — still loading data, warming caches, or waiting for database connections. Traffic should not be routed to an unready service.
Startup Checks
Answers the question: "Has this service finished starting up?" Some applications have slow startup sequences (loading ML models, migrating data, building indexes). Startup probes give extra time before liveness/readiness checks begin.
Health Check Endpoints in Practice
A well-designed health check endpoint provides actionable information. Here is a recommended structure:
// GET /health — Shallow check (liveness)
HTTP 200 OK
{ "status": "ok" }
// GET /health/ready — Deep check (readiness)
HTTP 200 OK
{
"status": "ok",
"checks": {
"database": "connected",
"cache": "available",
"disk": "sufficient"
}
}
Best practice: Keep liveness checks fast and dependency-free. If your liveness check calls the database and the database is slow, your process might get restarted unnecessarily. Reserve dependency checks for readiness endpoints.
HTTP Health Checks vs TCP vs ICMP
External monitoring systems use different protocols for health checks. Each has trade-offs:
| Protocol | What It Tests | Depth | Best For |
|---|---|---|---|
| HTTP/HTTPS | Full application response (status, body, headers) | Application layer | Web apps, APIs, microservices |
| TCP | Port connectivity (can connect to port) | Transport layer | Databases, mail servers, custom protocols |
| ICMP | Network reachability (host responds to ping) | Network layer | Infrastructure, routers, basic host availability |
Recommendation: Use HTTP checks whenever possible — they validate the application is actually working, not just that the port is open. A service can accept TCP connections but still return errors to every request. AtomPing supports all three protocols, plus DNS and TLS checks for comprehensive coverage.
Configuring Health Check Intervals
The check interval determines how quickly you detect outages. Here are guidelines for choosing the right interval:
30 seconds - 1 minute
Use for: Revenue-critical services, payment APIs, login systems
Maximum detection delay of 1 minute. Appropriate for services where every minute of downtime has significant impact.
2 - 3 minutes
Use for: Main website, API gateways, internal tools
Good balance between detection speed and resource usage. Suitable for most production services.
5 - 10 minutes
Use for: Documentation sites, staging environments, non-critical services
Conserves monitoring resources while still providing regular availability data.
Health Checks and Incident Detection
A single failed health check does not necessarily mean your service is down. Modern monitoring systems use multi-step incident detection to prevent false alarms:
Multi-Region Confirmation
When a check fails from one region, the system verifies the failure from other regions. If only one region reports a failure, it is likely a local network issue, not a real outage. AtomPing checks from multiple locations to confirm outages.
Consecutive Failure Thresholds
An incident is only created after a configurable number of consecutive failures. This prevents transient errors (a single slow response, a brief network hiccup) from triggering alerts. Common thresholds are 2-3 consecutive failures before alerting.
Recovery Verification
Similarly, a service is not marked as recovered until it passes multiple consecutive checks. This prevents premature recovery notifications when a service is flapping (alternating between up and down states).
Frequently Asked Questions
What is a health check in software systems?▼
What's the difference between a liveness check and a readiness check?▼
How often should health checks run?▼
What should a health check endpoint return?▼
Should health checks verify dependencies?▼
How do health checks relate to incident detection?▼
Can I health-check a service that doesn't have a /health endpoint?▼
Related Glossary Terms
Configure Health Checks with AtomPing
Monitor your health check endpoints from multiple regions with HTTP, TCP, ICMP, DNS, and TLS checks. Get instant alerts when checks fail. Free forever plan includes 50 monitors with email alerts.
Start Monitoring Free