A health check endpoint is the monitoring entry point. If it returns 200, your service is considered alive. If 503 — dead. How well-designed this endpoint is determines whether you catch a real outage in 30 seconds or miss degradation that users notice first.
Most teams implement a minimal GET /health → 200 OK. Better than nothing, but only catches complete process death. Real health checks verify dependencies, distinguish liveness from readiness, and provide diagnostic information for fast troubleshooting.
Two Levels of Health Checks
Liveness: "Is the process alive?"
The liveness probe answers one question: is the application running and responding to requests? It doesn't check dependencies — only the process itself. If a liveness probe fails, Kubernetes restarts the container and load balancer stops sending traffic.
Endpoint: GET /health/live
What it checks: Process is running, HTTP server responds
Success response: 200 {"status": "ok"}
Fails when: Deadlock, out of memory, infinite loop, event loop blocked
Liveness should be as lightweight as possible. No database calls, cache hits, or external API requests. If your liveness probe depends on the database, its failure causes cascading container restarts — making things worse.
Readiness: "Ready to handle traffic?"
The readiness probe answers: can this instance process a user request? It checks dependencies: database, cache, queues, external APIs. If readiness fails, the instance is removed from load balancing but not restarted.
Endpoint: GET /health/ready
What it checks: Database is accessible, Redis/cache responds, critical external APIs are reachable
Success response:
{
"status": "healthy",
"timestamp": "2026-03-26T10:30:00Z",
"checks": {
"database": {"status": "healthy", "latency_ms": 3},
"redis": {"status": "healthy", "latency_ms": 1},
"stripe_api": {"status": "healthy", "latency_ms": 45}
}
} Degradation response:
{
"status": "degraded",
"timestamp": "2026-03-26T10:30:00Z",
"checks": {
"database": {"status": "healthy", "latency_ms": 3},
"redis": {"status": "unhealthy", "error": "connection timeout"},
"stripe_api": {"status": "healthy", "latency_ms": 45}
}
} Designing Dependency Checks
Each dependency in a readiness probe should be checked independently with its own timeout. One slow check shouldn't block the entire endpoint.
Database check
# Python / Django
def check_database():
try:
start = time.monotonic()
with connection.cursor() as cursor:
cursor.execute("SELECT 1")
latency = (time.monotonic() - start) * 1000
return {"status": "healthy", "latency_ms": round(latency)}
except Exception as e:
return {"status": "unhealthy", "error": str(e)} Timeout: 3 seconds. SELECT 1 is the minimal query checking connection pool, network path to database, and basic PostgreSQL/MySQL functionality.
Cache check (Redis/Memcached)
def check_redis():
try:
start = time.monotonic()
redis_client.ping()
latency = (time.monotonic() - start) * 1000
return {"status": "healthy", "latency_ms": round(latency)}
except Exception as e:
return {"status": "unhealthy", "error": str(e)} Timeout: 2 seconds. PING checks connection and authentication. If cache is not critical (fallback to database), its failure can be degraded rather than unhealthy.
External API check
def check_stripe():
try:
start = time.monotonic()
response = requests.get(
"https://api.stripe.com/healthcheck",
timeout=5
)
latency = (time.monotonic() - start) * 1000
if response.status_code == 200:
return {"status": "healthy", "latency_ms": round(latency)}
return {"status": "degraded", "http_status": response.status_code}
except requests.Timeout:
return {"status": "unhealthy", "error": "timeout"}
except Exception as e:
return {"status": "unhealthy", "error": str(e)} Caution: External API checks increase health endpoint latency and create third-party dependency. Include only critical dependencies (payment, auth provider). Non-critical ones — check async and cache results for 30-60 seconds.
Classifying Dependencies: Critical vs Degraded
Not all dependencies matter equally. Database is critical: nothing works without it. Email service is degraded: app works, but emails don't send. Correct classification prevents false alarms from non-critical failures.
Critical (→ unhealthy, HTTP 503): Primary database, authentication service, core business logic dependencies
Degraded (→ degraded, HTTP 200): Cache (Redis/Memcached), email service, analytics, non-essential third-party APIs
Unchecked: CDN, logging service, metrics collection — their failure doesn't affect request processing ability
def get_overall_status(checks):
if any(c["status"] == "unhealthy" for name, c in checks.items()
if name in CRITICAL_DEPS):
return "unhealthy", 503
if any(c["status"] != "healthy" for c in checks.values()):
return "degraded", 200
return "healthy", 200 Response Format
Standardized response format simplifies integration with monitoring systems and enables JSON path assertions to verify specific fields.
{
"status": "healthy",
"version": "2.4.1",
"uptime_seconds": 86420,
"timestamp": "2026-03-26T10:30:00Z",
"checks": {
"database": {
"status": "healthy",
"latency_ms": 3,
"type": "postgresql"
},
"redis": {
"status": "healthy",
"latency_ms": 1,
"type": "redis"
},
"queue": {
"status": "healthy",
"latency_ms": 2,
"pending_jobs": 142,
"type": "celery"
}
}
}
In AtomPing you can configure an HTTP check with JSON path assertion $.status = healthy. This verifies not only that the endpoint responds 200, but that all dependencies are healthy. If the database fails, status changes to unhealthy, assertion fails, and monitoring creates an incident.
Kubernetes Probes: Configuration
In Kubernetes, liveness and readiness probes are configured in the pod manifest. Correct configuration balances detection speed with resilience to brief failures.
spec:
containers:
- name: api
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 2
startupProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30 startupProbe is a third type giving applications time to initialize (up to 150 seconds in the example above). Until startup probe passes, liveness and readiness don't run. Useful for applications with heavy startup (migrations, cache warming, ML model loading).
Common Mistakes
1. Liveness probe with dependency checks
Most common mistake: checking the database in liveness probe. Database fails → liveness fails → Kubernetes restarts all pods → pods start simultaneously → thundering herd on database → database never recovers. Liveness = process only. Dependencies = readiness only.
2. Missing timeout on individual checks
If one dependency check hangs for 30 seconds (e.g., DNS resolution timeout to external API), the entire health endpoint hangs. Kubernetes interprets this as failure. Solution: run checks in parallel with individual timeouts (2-5 seconds each).
3. Heavy health checks
Health endpoint is called every 10-30 seconds by dozens of consumers (Kubernetes, load balancer, external monitoring). If checks execute complex SQL queries or call 5 external APIs, they create noticeable load. Rule: health check endpoint should respond in 50-200ms, period.
4. Exposing secrets in health response
Never include connection strings, API keys, internal IP addresses, or table names in health check responses. Even if the endpoint is "internal only" — leaking one URL to logs exposes your infrastructure.
5. Single /health without liveness/readiness split
One GET /health endpoint forces a choice: check dependencies (risking cascading restarts) or skip them (missing degradation). Separating into /health/live and /health/ready solves this dilemma.
Advanced Patterns
Cached readiness
Instead of checking dependencies on every request, run a background task checking them every 10-15 seconds and caching results in memory. Health endpoint returns cached results instantly. Reduces load and prevents timeouts during high request frequency.
Graceful degradation signaling
Instead of binary healthy/unhealthy, use three states: healthy (everything works), degraded (non-critical dependencies unavailable but service works), unhealthy (critical dependency unavailable). Monitoring can react differently to each: degraded = warning, unhealthy = critical alert.
Deep health vs shallow health
Two endpoints: /health (shallow — fast process check for load balancer, 1-5ms) and /health/deep (full dependency check for monitoring, 50-200ms). Load balancer uses shallow, external monitoring uses deep. This separates consumer needs.
Integration with External Monitoring
Health check endpoint is half the solution. The other half is external monitoring regularly polling this endpoint from different regions and alerting on problems.
Setup in AtomPing:
1. Create HTTP monitor with URL https://api.yourapp.com/health/ready
2. Add JSON path assertion: $.status equals healthy
3. Set response time threshold: 5000ms (health endpoint shouldn't be slow)
4. Interval: 30 seconds
5. Enable quorum confirmation to prevent false alarms
External monitoring checks what Kubernetes probes can't: internet reachability (DNS, routing, TLS), performance from user perspective, and full chain health (CDN → load balancer → app → database).
Checklist: Designing Health Check Endpoints
Architecture: Separate /health/live (liveness) and /health/ready (readiness) endpoints
Dependency checks: Each dependency checked with individual timeout (2-5s)
Classification: Dependencies separated into critical and non-critical
Response format: JSON with overall status, per-dependency status, latency, timestamp
Performance: Health endpoint responds in 50-200ms under normal conditions
Security: No secrets in response, liveness doesn't require auth
Monitoring: Endpoint polled by external monitoring with JSON path assertions
Related Articles
API Monitoring: Complete Guide — How to monitor REST API endpoints
Monitoring Microservices — Health checks in distributed systems
Internal vs External Monitoring — Why you need both approaches
How to Reduce False Alarms — Quorum confirmation and batch anomaly detection
FAQ
What is a health check endpoint?
A health check endpoint is a dedicated API route (typically /health or /healthz) that returns the current operational status of your application. It verifies that the app is running, its dependencies (database, cache, external APIs) are reachable, and critical subsystems function correctly. Monitoring tools poll this endpoint to detect outages.
Should I use /health or /healthz?
/health is more readable and widely understood. /healthz originated in Kubernetes (from Google's convention of appending 'z' to internal endpoints). Both work — pick one and be consistent. Kubernetes specifically supports both. If you're building a public API, /health is the more standard choice.
What should a health check endpoint return?
At minimum: HTTP 200 with a JSON body containing overall status and individual dependency checks (database, cache, queue). Include response time for each dependency. Return HTTP 503 when any critical dependency is unhealthy. Always include a timestamp. Optionally: version number, uptime duration, and region identifier.
Should health checks be authenticated?
The basic liveness endpoint (/health/live) should not require authentication — monitoring tools and load balancers need unauthenticated access. The detailed readiness endpoint (/health/ready) can optionally require authentication if it exposes internal architecture details. Never expose sensitive data (connection strings, credentials) in health check responses.
How often should monitoring tools poll health endpoints?
Every 30 seconds for production services with SLA commitments. Every 1-3 minutes for internal tools and staging environments. Every 5 minutes for non-critical services. The endpoint itself should respond within 5 seconds — if dependency checks take longer, implement timeouts and return partial status.
What's the difference between liveness and readiness probes?
A liveness probe checks 'is the process alive?' — if it fails, the container should be restarted. A readiness probe checks 'can this instance handle traffic?' — if it fails, the instance is removed from the load balancer but not restarted. Your app can be alive (liveness pass) but not ready (readiness fail) during startup or when a dependency is down.