What is a Health Check?
A health check is an automated test that verifies whether a system or service is functioning correctly. Health checks are the building blocks of monitoring, load balancing, and container orchestration — they determine when to send traffic, when to restart processes, and when to alert your team.
Definition
A health check is a periodic, automated verification that a system component is alive, responsive, and capable of serving its intended function. Health checks range from simple TCP port probes to comprehensive endpoint tests that validate database connectivity, cache availability, and downstream service dependencies.
For example, a typical health check endpoint at /health might return { "status": "ok", "db": "connected", "cache": "available" } with a 200 status code, confirming that all critical subsystems are operational.
Types of Health Checks
Different types of health checks serve different purposes. Understanding when to use each type is critical for building resilient systems:
Liveness Checks
Answers the question: "Is this process running?" A liveness check verifies that the application process has not crashed or entered a deadlocked state. If a liveness check fails, the process should be restarted.
/healthz every 15 seconds. If 3 consecutive probes fail, Kubernetes kills and restarts the pod.
Readiness Checks
Answers the question: "Can this service handle requests?" A service might be alive but not ready — still loading data, warming caches, or waiting for database connections. Traffic should not be routed to an unready service.
Startup Checks
Answers the question: "Has this service finished starting up?" Some applications have slow startup sequences (loading ML models, migrating data, building indexes). Startup probes give extra time before liveness/readiness checks begin.
Health Check Endpoints in Practice
A well-designed health check endpoint provides actionable information. Here is a recommended structure:
// GET /health — Shallow check (liveness)
HTTP 200 OK
{ "status": "ok" }
// GET /health/ready — Deep check (readiness)
HTTP 200 OK
{
"status": "ok",
"checks": {
"database": "connected",
"cache": "available",
"disk": "sufficient"
}
}
Best practice: Keep liveness checks fast and dependency-free. If your liveness check calls the database and the database is slow, your process might get restarted unnecessarily. Reserve dependency checks for readiness endpoints.
HTTP Health Checks vs TCP vs ICMP
External monitoring systems use different protocols for health checks. Each has trade-offs:
| Protocol | What It Tests | Depth | Best For |
|---|---|---|---|
| HTTP/HTTPS | Full application response (status, body, headers) | Application layer | Web apps, APIs, microservices |
| TCP | Port connectivity (can connect to port) | Transport layer | Databases, mail servers, custom protocols |
| ICMP | Network reachability (host responds to ping) | Network layer | Infrastructure, routers, basic host availability |
Recommendation: Use HTTP checks whenever possible — they validate the application is actually working, not just that the port is open. A service can accept TCP connections but still return errors to every request. AtomPing supports all three protocols, plus DNS and TLS checks for comprehensive coverage.
Configuring Health Check Intervals
The check interval determines how quickly you detect outages. Here are guidelines for choosing the right interval:
30 seconds - 1 minute
Use for: Revenue-critical services, payment APIs, login systems
Maximum detection delay of 1 minute. Appropriate for services where every minute of downtime has significant impact.
2 - 3 minutes
Use for: Main website, API gateways, internal tools
Good balance between detection speed and resource usage. Suitable for most production services.
5 - 10 minutes
Use for: Documentation sites, staging environments, non-critical services
Conserves monitoring resources while still providing regular availability data.
Health Checks and Incident Detection
A single failed health check does not necessarily mean your service is down. Modern monitoring systems use multi-step incident detection to prevent false alarms:
Multi-Region Confirmation
When a check fails from one region, the system verifies the failure from other regions. If only one region reports a failure, it is likely a local network issue, not a real outage. AtomPing checks from multiple locations to confirm outages.
Consecutive Failure Thresholds
An incident is only created after a configurable number of consecutive failures. This prevents transient errors (a single slow response, a brief network hiccup) from triggering alerts. Common thresholds are 2-3 consecutive failures before alerting.
Recovery Verification
Similarly, a service is not marked as recovered until it passes multiple consecutive checks. This prevents premature recovery notifications when a service is flapping (alternating between up and down states).
Frequently Asked Questions
What is a health check in software systems?
A health check is an automated test that verifies whether a service, application, or infrastructure component is running correctly. It typically involves sending a request to a dedicated endpoint (like /health or /status) and evaluating the response to determine if the system is healthy, degraded, or down.
What's the difference between a liveness check and a readiness check?
A liveness check determines whether a process is running — if it fails, the process should be restarted. A readiness check determines whether the service can handle traffic — if it fails, traffic should be routed elsewhere but the process stays running. For example, an app might be alive but not ready if it's still loading configuration or warming caches.
How often should health checks run?
It depends on the context. External uptime monitoring typically checks every 30 seconds to 5 minutes. Container orchestrators like Kubernetes run liveness probes every 10-30 seconds. Load balancers check backend health every 5-30 seconds. The right interval balances detection speed against resource overhead.
What should a health check endpoint return?
A well-designed health check endpoint returns an HTTP 200 status when healthy and a 5xx status when unhealthy. It should also include details about subsystem health (database connectivity, cache availability, disk space) in the response body as JSON, so operators can quickly identify which component is causing issues.
Should health checks verify dependencies?
It depends on the check type. A shallow health check (liveness) should only verify the process is running — no dependency checks. A deep health check (readiness) should verify critical dependencies like databases, caches, and external APIs. Deep checks are more informative but can cascade failures if a dependency is slow.
How do health checks relate to incident detection?
Health checks are the foundation of incident detection. When checks fail consistently from multiple locations over multiple cycles, the monitoring system creates an incident. This approach prevents false alarms from transient issues while still detecting real outages quickly.
Can I health-check a service that doesn't have a /health endpoint?
Yes. You can use TCP checks to verify a port is open, ICMP ping to verify the host is reachable, or HTTP checks against any URL that returns a predictable response. However, dedicated health endpoints are preferred because they can report on internal service state, not just network reachability.
Definition
Monitor your health check endpoints from multiple regions with HTTP, TCP, ICMP, DNS, and TLS checks. Get instant alerts when checks fail. Free forever plan includes 50 monitors with email alerts.
Start Monitoring Free