Monitoring Microservices: A Practical Guide

How to monitor microservices effectively: health-check patterns, service mesh observability, cascading-failure detection, and distributed tracing.

2026-03-26 · 15 min · Technical Guide

Microservices provide flexibility and scalability. But they also transform monitoring from a simple task ("one server, one health check") into a multidimensional challenge: 20 services, 50 inter-service connections, 5 databases, 3 queues. Something broke — but what exactly, and what's the impact on users?

In this guide, we explore how to build monitoring for a microservices architecture: from health check endpoints for each service to end-to-end synthetic checks, from distributed tracing to status page mapping.

Four Layers of Monitoring

Complete microservices monitoring operates on four levels. Each answers its own question.

Layer 1: Health of Each Service

Question: Is each individual service running?

How: health check endpoint in each service (/health/live + /health/ready)

What it checks: process is alive, dependencies (database, cache) are reachable, service is ready to accept traffic

Tools: Kubernetes probes (internal), external monitoring (AtomPing HTTP checks with JSON path assertions)

Layer 2: Inter-Service Communication

Question: Can services communicate with each other?

How: metrics at each service boundary — request rate, error rate, latency (p50, p95, p99)

What it catches: network partitions, timeout cascades, serialization errors, circuit breaker trips

Tools: Prometheus + Grafana (internal metrics), Istio/Linkerd service mesh (automatic telemetry), OpenTelemetry SDK (manual instrumentation)

Layer 3: Infrastructure

Question: Is there enough capacity?

How: CPU, memory, disk, network I/O for each container/pod

What it catches: memory leaks, CPU throttling, disk exhaustion, noisy neighbors

Tools: Kubernetes metrics-server, Prometheus node_exporter, cloud provider metrics (CloudWatch, GCP Monitoring)

Layer 4: End-to-End User Flows

Question: Can users complete their intended actions?

How: synthetic monitoring — external HTTP checks that traverse the entire chain

What it catches: everything that layers 1-3 might miss: DNS issues, TLS problems, CDN failures, load balancer misconfigurations

Tools: AtomPing (HTTP, DNS, SSL, API monitoring with assertions), PageSpeed monitoring

Health Check Patterns for Microservices

Pattern 1: Aggregate Health Endpoint

An API Gateway or BFF (Backend-for-Frontend) service provides a single /health endpoint that checks critical downstream services. Monitoring queries one URL that covers the entire chain.

# API Gateway health check
GET /health/ready

{
  "status": "healthy",
  "services": {
    "user-service": {"status": "healthy", "latency_ms": 12},
    "order-service": {"status": "healthy", "latency_ms": 8},
    "payment-service": {"status": "healthy", "latency_ms": 23},
    "notification-service": {"status": "degraded", "latency_ms": 450}
  }
}

When to use: you have an API Gateway. Allows you to monitor the entire system with one check. Limitation: doesn't cover services that bypass the Gateway.

Pattern 2: Per-Service Monitoring

Each service has its own health endpoint, each monitored separately. More monitors, but more precise diagnostics: immediately see which service failed.

Monitors:

user-service.internal:8080/health/ready → HTTP check + JSON assertion

order-service.internal:8080/health/ready → HTTP check + JSON assertion

payment-service.internal:8080/health/ready → HTTP check + JSON assertion

When to use: when external monitoring has access to internal endpoints (VPN, private network), or when services are exposed via subdomains/paths.

Pattern 3: Critical Path Monitoring

Instead of monitoring each service separately, monitor critical user flows end-to-end. One check traverses 3-5 services in a chain.

Example: "Login flow" check:

POST /api/auth/login → passes through API Gateway → auth-service → user-service → token-service

Assertion: response contains $.access_token

If any service in the chain fails, the check fails. One monitor covers 4 services.

When to use: together with per-service monitoring. Critical path catches end-to-end issues, per-service helps localize problems.

Cascading Failures: Detection and Prevention

Cascading failure is the primary threat in microservices. One slow service can bring down the entire system.

How a Cascade Develops

1. Payment-service starts responding in 10 seconds instead of 100ms (database overloaded)

2. Order-service waits for payment-service response — its threads/goroutines become occupied

3. Order-service stops responding to new requests (thread pool exhaustion)

4. API Gateway gets timeouts from order-service — retries load it further

5. Users see 502/504 errors on checkout page

Monitoring Cascades

Latency p99: the first signal of degradation. If payment-service p99 increases from 100ms to 5s, a cascade is starting.

Error rate spikes: 5xx from payment-service → 5xx from order-service → 5xx at API Gateway. If errors propagate up the chain, it's a cascade.

Circuit breaker state: if a circuit breaker trips, capture it as an event in monitoring.

End-to-end check: AtomPing HTTP check on POST /api/orders with response time threshold — catches end-user impact even if internal metrics are ambiguous.

Preventing Cascades

Timeouts: every inter-service call must have an explicit timeout (1-5 seconds). No timeout = infinite wait = thread leak.

Circuit breakers: after N consecutive failures to a downstream service, stop attempting for M seconds. Hystrix/Resilience4j/Polly.

Bulkheads: isolate thread pools for different downstream services. A slow payment-service shouldn't exhaust threads meant for user-service.

Retry budget: limit retries to 10-20% of total traffic. If 50% of requests are retries, you're amplifying the overload, not solving it.

Distributed Tracing

Distributed tracing lets you follow a single request across all services. A user request gets a trace ID that propagates between services. Each service records a span — its processing time.

Tools: Jaeger, Zipkin, Tempo (Grafana), Datadog APM, New Relic

Standard: OpenTelemetry (W3C Trace Context) — unified SDK for metrics, traces, logs

What it provides: "user request took 3.2s → of which 2.8s waiting for payment-service → payment-service spent 2.7s on a SQL query". Without tracing, you know something is slow, but not why.

Distributed tracing is complementary to external monitoring. Monitoring answers "what broke and when". Tracing answers "why and where exactly". AtomPing detects that the checkout endpoint responds in 5 seconds. Jaeger shows the bottleneck is a SQL query in payment-service.

Status Page Mapping

In a microservices architecture, status pages require mapping internal services → public components.

Status page component "API" ← api-gateway, auth-service, rate-limiter

Status page component "Dashboard" ← frontend, user-service, analytics-service

Status page component "Payments" ← payment-service, billing-service, Stripe integration

Status page component "Notifications" ← notification-service, email-service, webhook-service

In AtomPing, each monitor is linked to a status page component. If the check on payment-service/health fails, the "Payments" component automatically transitions to degraded/down, the status page updates, and subscribers receive a notification.

Alerting Strategy

With 20+ services, alert fatigue is a real problem. If each service sends alerts independently, an on-call engineer gets 15 notifications at once during a cascade.

Rule 1: Alert on user impact, not service failure. Alerting on "checkout endpoint returns 503" is more important than "payment-service pod restarted". The first is a symptom, the second is one possible cause.

Rule 2: Grouping. One alert "3 services degraded in payment chain" instead of three separate alerts. AtomPing groups incidents by target.

Rule 3: Severity by business impact. P1 — checkout flow down (revenue impact). P3 — analytics-service degraded (no immediate user impact).

Rule 4: Quorum. Don't alert on single failures. AtomPing quorum confirmation (2/3 agents confirm) prevents false alarms from network glitches.

Practical Monitoring Plan

For a typical microservices architecture with 10-20 services:

Step 1: Each service has /health/live + /health/ready. Kubernetes probes configured.

Step 2: 3-5 AtomPing HTTP monitors for critical user flows (login, API, checkout). JSON path assertions on response body. 30s interval.

Step 3: AtomPing DNS monitor + SSL monitor for each public domain.

Step 4: Status page with components, map monitors → components.

Step 5: Prometheus/Grafana for internal metrics (CPU, latency, error rates). Alerts via AlertManager.

Step 6: OpenTelemetry tracing across all services for root cause analysis.

Steps 1-4 cover 80% of needs and take 1-2 hours to set up. AtomPing's free tier (50 monitors) is sufficient for most microservices architectures. Steps 5-6 are for mature teams with dedicated SRE.

Related Resources

Health Check Endpoint Design — how to design /health endpoints

Internal vs External Monitoring — why you need both approaches

API Monitoring Guide — monitoring REST API endpoints

Incident Management Guide — detection, response, post-mortem

How to Reduce False Alarms — quorum confirmation in distributed systems

FAQ

Why is monitoring microservices harder than monitoring monoliths?

In a monolith, one health check covers the whole app. In microservices, you have dozens of independent services communicating over the network. Failure modes multiply: service A calls B, which calls C — a timeout in C can cascade through B to A. You need monitoring at every layer: individual service health, inter-service communication, and end-to-end user flows.

What should I monitor in a microservices architecture?

Four layers: (1) Individual service health — health check endpoints for each service, (2) Inter-service communication — latency and error rates between services, (3) Infrastructure — CPU, memory, network for containers/pods, (4) End-to-end user flows — external synthetic checks that verify the complete user journey works.

How do I detect cascading failures in microservices?

Monitor error rates and latency at service boundaries. If Service B's error rate spikes, check if Service A (which calls B) also degrades. Distributed tracing (Jaeger, Zipkin) shows the full request path. External monitoring catches the end-user impact even when internal metrics are ambiguous.

Should each microservice have its own status page component?

Group by user-facing capability, not by internal service. Users don't care about 'user-service' or 'payment-service' — they care about 'Login', 'Checkout', 'API'. Map multiple services to one status page component. If user-service and auth-service both fail, the user sees 'Login: Degraded', not two separate incidents.

What's the role of external monitoring in a microservices setup?

External monitoring verifies the end-to-end system works from the user's perspective. Internal monitoring (Prometheus, Grafana) tells you which service is slow. External monitoring (AtomPing) tells you the user is actually affected. You need both — internal for diagnosis, external for detection.

How many monitors do I need for microservices?

Rule of thumb: one health check per service (N services = N monitors), plus 3-5 synthetic checks for critical user flows (login, checkout, API), plus DNS and SSL monitoring. A typical 20-service architecture needs 25-30 monitors. AtomPing's free plan (50 monitors) covers most microservice architectures.

Start monitoring your infrastructure

Start Free View Pricing

Monitoring

Features

Tools

Resources