What is a Health Check?

A health check is an automated test that verifies whether a system or service is functioning correctly. Health checks are the building blocks of monitoring, load balancing, and container orchestration — they determine when to send traffic, when to restart processes, and when to alert your team.

Definition

A health check is a periodic, automated verification that a system component is alive, responsive, and capable of serving its intended function. Health checks range from simple TCP port probes to comprehensive endpoint tests that validate database connectivity, cache availability, and downstream service dependencies.

For example, a typical health check endpoint at /health might return { "status": "ok", "db": "connected", "cache": "available" } with a 200 status code, confirming that all critical subsystems are operational.

Types of Health Checks

Different types of health checks serve different purposes. Understanding when to use each type is critical for building resilient systems:

Liveness Checks

Answers the question: "Is this process running?" A liveness check verifies that the application process has not crashed or entered a deadlocked state. If a liveness check fails, the process should be restarted.

Example: A Kubernetes liveness probe sends an HTTP GET to /healthz every 15 seconds. If 3 consecutive probes fail, Kubernetes kills and restarts the pod.

Readiness Checks

Answers the question: "Can this service handle requests?" A service might be alive but not ready — still loading data, warming caches, or waiting for database connections. Traffic should not be routed to an unready service.

Example: A readiness probe checks that the database connection pool is initialized and the in-memory cache is populated. Until ready, the load balancer routes traffic to other instances.

Startup Checks

Answers the question: "Has this service finished starting up?" Some applications have slow startup sequences (loading ML models, migrating data, building indexes). Startup probes give extra time before liveness/readiness checks begin.

Example: A Java application takes 90 seconds to start. A startup probe with a 120-second timeout prevents Kubernetes from killing the pod during normal initialization.

Health Check Endpoints in Practice

A well-designed health check endpoint provides actionable information. Here is a recommended structure:

// GET /health — Shallow check (liveness)

HTTP 200 OK

{ "status": "ok" }

// GET /health/ready — Deep check (readiness)

HTTP 200 OK

{

"status": "ok",

"checks": {

"database": "connected",

"cache": "available",

"disk": "sufficient"

}

Best practice: Keep liveness checks fast and dependency-free. If your liveness check calls the database and the database is slow, your process might get restarted unnecessarily. Reserve dependency checks for readiness endpoints.

HTTP Health Checks vs TCP vs ICMP

External monitoring systems use different protocols for health checks. Each has trade-offs:

Protocol	What It Tests	Depth	Best For
HTTP/HTTPS	Full application response (status, body, headers)	Application layer	Web apps, APIs, microservices
TCP	Port connectivity (can connect to port)	Transport layer	Databases, mail servers, custom protocols
ICMP	Network reachability (host responds to ping)	Network layer	Infrastructure, routers, basic host availability

Recommendation: Use HTTP checks whenever possible — they validate the application is actually working, not just that the port is open. A service can accept TCP connections but still return errors to every request. AtomPing supports all three protocols, plus DNS and TLS checks for comprehensive coverage.

Configuring Health Check Intervals

The check interval determines how quickly you detect outages. Here are guidelines for choosing the right interval:

30 seconds - 1 minute

Use for: Revenue-critical services, payment APIs, login systems

Maximum detection delay of 1 minute. Appropriate for services where every minute of downtime has significant impact.

2 - 3 minutes

Use for: Main website, API gateways, internal tools

Good balance between detection speed and resource usage. Suitable for most production services.

5 - 10 minutes

Use for: Documentation sites, staging environments, non-critical services

Conserves monitoring resources while still providing regular availability data.

Health Checks and Incident Detection

A single failed health check does not necessarily mean your service is down. Modern monitoring systems use multi-step incident detection to prevent false alarms:

Multi-Region Confirmation

When a check fails from one region, the system verifies the failure from other regions. If only one region reports a failure, it is likely a local network issue, not a real outage. AtomPing checks from multiple locations to confirm outages.

Consecutive Failure Thresholds

An incident is only created after a configurable number of consecutive failures. This prevents transient errors (a single slow response, a brief network hiccup) from triggering alerts. Common thresholds are 2-3 consecutive failures before alerting.

Recovery Verification

Similarly, a service is not marked as recovered until it passes multiple consecutive checks. This prevents premature recovery notifications when a service is flapping (alternating between up and down states).

Frequently Asked Questions

What is a health check in software systems?

A health check is an automated test that verifies whether a service, application, or infrastructure component is running correctly. It typically involves sending a request to a dedicated endpoint (like /health or /status) and evaluating the response to determine if the system is healthy, degraded, or down.

What's the difference between a liveness check and a readiness check?

A liveness check determines whether a process is running — if it fails, the process should be restarted. A readiness check determines whether the service can handle traffic — if it fails, traffic should be routed elsewhere but the process stays running. For example, an app might be alive but not ready if it's still loading configuration or warming caches.

How often should health checks run?

It depends on the context. External uptime monitoring typically checks every 30 seconds to 5 minutes. Container orchestrators like Kubernetes run liveness probes every 10-30 seconds. Load balancers check backend health every 5-30 seconds. The right interval balances detection speed against resource overhead.

What should a health check endpoint return?

A well-designed health check endpoint returns an HTTP 200 status when healthy and a 5xx status when unhealthy. It should also include details about subsystem health (database connectivity, cache availability, disk space) in the response body as JSON, so operators can quickly identify which component is causing issues.

Should health checks verify dependencies?

It depends on the check type. A shallow health check (liveness) should only verify the process is running — no dependency checks. A deep health check (readiness) should verify critical dependencies like databases, caches, and external APIs. Deep checks are more informative but can cascade failures if a dependency is slow.

How do health checks relate to incident detection?

Health checks are the foundation of incident detection. When checks fail consistently from multiple locations over multiple cycles, the monitoring system creates an incident. This approach prevents false alarms from transient issues while still detecting real outages quickly.

Can I health-check a service that doesn't have a /health endpoint?

Yes. You can use TCP checks to verify a port is open, ICMP ping to verify the host is reachable, or HTTP checks against any URL that returns a predictable response. However, dedicated health endpoints are preferred because they can report on internal service state, not just network reachability.

Definition

Monitor your health check endpoints from multiple regions with HTTP, TCP, ICMP, DNS, and TLS checks. Get instant alerts when checks fail. Free forever plan includes 50 monitors with email alerts.

Start Monitoring Free

Monitoring

Features

Tools

Resources

What is a Health Check?

Definition

Types of Health Checks

Liveness Checks

Readiness Checks

Startup Checks

Health Check Endpoints in Practice

HTTP Health Checks vs TCP vs ICMP

Configuring Health Check Intervals

30 seconds - 1 minute

2 - 3 minutes

5 - 10 minutes

Health Checks and Incident Detection

Multi-Region Confirmation

Consecutive Failure Thresholds

Recovery Verification

Frequently Asked Questions

Definition

Monitoring

Features

Tools

Resources

What is a Health Check?

Definition

Types of Health Checks

Liveness Checks

Readiness Checks

Startup Checks

Health Check Endpoints in Practice

HTTP Health Checks vs TCP vs ICMP

Configuring Health Check Intervals

30 seconds - 1 minute

2 - 3 minutes

5 - 10 minutes

Health Checks and Incident Detection

Multi-Region Confirmation

Consecutive Failure Thresholds

Recovery Verification

Frequently Asked Questions

Related Glossary Terms

Definition