In 2013, Amazon lost $66,240 per second during a 30-minute outage. In 2024, a CrowdStrike update crashed 8.5 million Windows machines globally. Between these incidents are thousands of less-publicized stories: a SaaS service that lost an enterprise customer due to a 4-hour outage; a startup whose payment API silently broke every night; an e-commerce platform that learned about downtime from complaints on Twitter.
Uptime monitoring is an early warning system. It doesn't prevent downtime (that's the job of reliability engineering), but it reduces the time between "service failed" and "team noticed" from hours to seconds. And that time is the most expensive: while you don't know about the problem, you can't fix it.
What is uptime monitoring?
Uptime monitoring is continuous, automated checking of your service's availability from external locations. The key word is external. Internal monitoring (Prometheus scraping metrics from pods, Grafana dashboards) is useful, but it runs inside your infrastructure. If the network between your datacenter and users goes down, internal monitoring shows "all green" while users see timeouts.
External monitoring emulates a user: it sends HTTP requests from Frankfurt, Paris, Helsinki—from the same locations where real users connect. If the request fails, you know your users can't access the service either.
How checks work
Every N seconds (30, 60, 300—depending on your configuration), a monitoring agent performs a cycle:
1. DNS Resolution. Resolves the domain name to an IP address. If DNS doesn't respond, the check fails immediately.
2. TCP Connection. Establishes a TCP connection to the server. Measures connection time (TCP handshake).
3. TLS Handshake. For HTTPS endpoints, establishes an encrypted connection. Validates the certificate.
4. HTTP Request. Sends a request (GET, POST, HEAD). Starts the TTFB timer.
5. Response. Receives the response: status code, headers, body. Records TTFB and total response time.
6. Validation. Compares the result against expectations: correct status code? Does the body contain the required keyword? Is response time within threshold?
Each check produces a result: UP (everything okay) or DOWN (something failed) with a reason: timeout, wrong status code, keyword not found, TLS error. Results from all agents are aggregated at the control plane, which decides whether to open an incident.
Types of monitoring
"Uptime monitoring" is an umbrella term covering a dozen types of checks, each for a specific scenario. No single type replaces the others; a robust monitoring setup combines several.
HTTP/HTTPS Monitoring
The most common type. Sends an HTTP request to a URL and validates the response. Suitable for websites, APIs, and health check endpoints. HTTP monitoring is the foundation everyone starts with.
What you can check: status code (200, 201, 204...), response time and TTFB, response content (keywords, JSON paths), headers (Content-Type, Cache-Control, CORS), response size, redirect chains.
When to use: for any HTTP/HTTPS endpoint. Homepage, API health checks, login pages, checkout flows—all are HTTP checks. For API monitoring, add JSON path assertions and custom headers (Bearer tokens, API keys).
TCP/Port Monitoring
Checks whether a port on your server is open and accepting TCP connections. Doesn't send HTTP requests—just performs a TCP handshake. Port monitoring is used for services that don't operate over HTTP: databases (PostgreSQL:5432, MySQL:3306), mail servers (SMTP:25/587), custom TCP protocols (game servers, IoT gateways).
ICMP/Ping Monitoring
Sends an ICMP echo request (ping) to check if a host is reachable at the network level. Ping monitoring is the lowest-level type: if ping fails, either the host is offline or the network between you is down. Used for servers, routers, and network appliances.
Limitation: many cloud providers block ICMP. If ping fails but HTTP works, the issue is the firewall, not the server. Ping complements HTTP monitoring but doesn't replace it.
DNS Monitoring
Checks that your domain's DNS records are correct: A records point to the right IP, MX records are valid for email, TXT records (SPF, DKIM) are in place. DNS monitoring catches problems that HTTP checks detect with a delay—because if DNS doesn't resolve, HTTP requests never begin.
Common scenarios: expired domain, accidentally deleted A record, DNS provider migration with lost records, DNS hijacking. Learn more in the complete DNS monitoring guide.
SSL/TLS Monitoring
Validates SSL/TLS certificate integrity: expiration date, certificate chain completeness, hostname match, TLS protocol version. TLS monitoring sends alerts 30 days before expiration—enough time to fix auto-renewal if it breaks.
An expired SSL certificate means complete failure for users, API clients, and mobile apps. It's one of the few types of downtime that is 100% preventable with proper monitoring.
Keyword Monitoring
HTTP check plus response content validation. Keyword checks search for (or confirm the absence of) specific text in the response. They catch silent failures: server returns 200 OK but delivers an empty page, nginx HTML error, or "Database connection failed" buried in JSON.
Heartbeat / Cron Job Monitoring
Inverted monitoring: instead of the system checking your service, your service pings the system. Heartbeat monitoring (also known as the "dead man's switch") is used for cron jobs, scheduled tasks, and background workers—processes that don't have HTTP endpoints to check. If a ping doesn't arrive on time, you get an alert.
Learn more: Cron Job Monitoring Guide.
Page Speed Monitoring
Loads the page in a headless browser and measures Core Web Vitals: LCP (Largest Contentful Paint), FID (First Input Delay), CLS (Cumulative Layout Shift). This is closer to synthetic monitoring of user experience than to traditional uptime checks.
Multi-region monitoring: why one location isn't enough
Monitoring from one location is like a security camera with a single angle. It sees the door is closed but misses the broken window. A single agent in one datacenter is vulnerable to local network problems: BGP routing issues, DNS cache problems, provider maintenance. Any of these can look like "your site is down" even though it works for 99.99% of your users.
Multi-region monitoring solves this fundamentally:
Filtering false positives. One agent sees DOWN, the other 10 see UP? That's a local network problem, not your server. Without multi-region, you'd get a false alert at 3 AM.
Real latency picture. Response time from Frankfurt is 80ms. From Helsinki, 200ms. From Lisbon, 350ms. You see where users experience slowness and can optimize your CDN or add an edge server.
Detecting regional problems. A CDN edge in one region might cache stale data or return errors while other regions work fine. Single-probe monitoring wouldn't catch this—or would only catch it by chance, depending on where the probe happens to be.
Quorum confirmation: consensus instead of retries
The traditional approach to fighting false positives is retries: check three times, alert if all three are DOWN. The problem: retries increase detection latency. With a 60-second interval and 3 retries, you learn about the problem in at least 3 minutes.
Quorum confirmation is a different approach: instead of repeated checks from one location, ask all locations simultaneously. If 8 of 11 agents report DOWN, that's a real incident confirmed in one cycle (30 seconds), not three retries (3 minutes). If only 2 of 11 report DOWN, those two agents have a local problem and the result is suppressed.
Result: false positive rates drop from typical 5–15% to under 0.1%, while detection speed doesn't suffer. Learn more about this mechanism in our guide to reducing false alarms.
Incident detection: from failed check to open incident
Not every failed check is an incident. A single timeout from one region is a transient error. Three consecutive failures from eight regions is a pattern worth investigating. Incident detection is the algorithm that separates noise from signal.
Two-level system: soft and hard incidents
Soft incident. One or more checks return DOWN in a single cycle. The system notes "something happened" and watches more closely. No alert is sent. If the next cycle returns to normal, the soft incident closes quietly—it was just jitter.
Hard incident. N regions confirm DOWN for M consecutive cycles. Default: 3 regions, 3 cycles. This is a confirmed problem. An alert is sent to Slack, Telegram, email, webhook—wherever you've configured.
Recovery. Closing an incident requires R consecutive successful cycles (default: 2). This is hysteresis—protection against flapping where your service oscillates between UP and DOWN, generating 10 "resolved" and 10 "opened" messages in 5 minutes. Recovery cycles prevent this.
All parameters (regions, cycles, recovery) are configured per target via AlertPolicy. Critical payment API: hard_cycles=1, immediate alert. Staging environment: hard_cycles=5, alert only on sustained problems.
Batch anomaly detection
A separate filtering layer. If one agent loses network entirely, all of its checks return DOWN—50, 100, 200 targets at once. Without batch anomaly detection, you'd get 200 alerts. With it, the system detects: "Agent-X reports DOWN for 80% of its targets, other agents report UP → problem is Agent-X, not the targets." All of Agent-X's results are suppressed until recovery.
Response time: not just "is it up" but "how fast"
Availability monitoring answers "is it up or down?" Response time monitoring answers "how well is it working?" Both matter: a site with 99.99% uptime but 5-second response times is technically available but functionally useless.
Key response time metrics:
TTFB (Time to First Byte). Time from request to first byte of response. Isolates server performance from network speed. TTFB > 500ms is a signal to investigate your backend.
Total Response Time. Complete time from request to receiving the last byte. For small JSON responses, this is roughly equal to TTFB. For heavy pages, it can differ significantly.
Percentiles (p50, p95, p99). Average response time is misleading—it masks the tail of the distribution. Monitor p95 and p99: they show what the slowest 5% and 1% of users experience.
Alerting: drowning in notifications
Monitoring without alerting is a useless dashboard no one reads. Alerting without proper thresholds is a noise generator everyone ignores. The goal is alerts you trust: each notification represents a real problem requiring action.
Notification channels
Not all channels are equally reliable. Email has delivery delays. Slack gets lost in the message stream. SMS is expensive but reliable. Optimal strategy: primary channel (Slack/Telegram) for immediate response + fallback (email/SMS) for critical incidents not acknowledged within 15 minutes.
Escalation
If the on-call engineer doesn't respond within N minutes, escalate the notification: to the next person in rotation, the team lead, the CTO. Without escalation, critical incidents can hang for hours because the alert recipient is asleep with their phone on silent.
Muting and maintenance windows
Before scheduled deployments or maintenance, mute the target. This isn't masking problems—it's hygiene: deployment causes brief downtime (container restarts), and alerts during that time are noise, not signal.
SLA, SLO, SLI: the monitoring connection
Monitoring is the mechanism for collecting SLI (Service Level Indicator) data. SLI is a metric (99.95% availability, p95 latency 200ms). SLO is an internal goal (we aim for 99.95%). SLA is a customer contract (we guarantee 99.9%, or we provide compensation).
Without monitoring, you have no SLI. Without SLI, there's no SLO. Without SLO, there's no SLA. Monitoring is the foundation of the entire reliability pyramid.
The uptime calculator converts uptime percentages to allowed downtime: 99.9% = 43.8 minutes per month. 99.95% = 21.9 minutes. 99.99% = 4.38 minutes.
Status pages: communicating with users
Monitoring detects the problem. The status page communicates it to users. These are two links in one chain: an incident is automatically created when detected, the status page updates, and users see the current status.
A good status page has: components organized by user workflows, separate infrastructure (available even when the main service is down), update subscriptions, and 90-day incident history. Learn more: Status Page Best Practices and 15 best examples.
The cost of downtime: why this matters
Monitoring is an investment. To justify it, understand the cost of downtime. For SMBs: ~$427/minute (Gartner). For mid-market SaaS: ~$5,600/minute (ITIC). For enterprise e-commerce: $11,000+/minute.
But direct losses are only part of it. Hidden costs include: engineering hours for diagnosis, SLA credits, Google ranking drops (search engines penalize unreliable sites), customer churn from lost trust. The real cost of an hour-long outage is 1.5–2x the direct losses.
Monitoring doesn't prevent downtime. But it reduces MTTR (Mean Time to Resolution). The difference between detecting a problem in 30 seconds versus 2 hours is the difference between a 5-minute incident and a 2.5-hour one.
Setup: step-by-step checklist
A minimal configuration for production services that covers 95% of needs:
1. HTTP monitoring of key endpoints. Homepage, API health checks (/health/ready), login, critical user flows (checkout, dashboard). Interval: 1 minute. Regions: 3+. Add keyword checks to each to catch silent failures (200 OK with empty body).
2. DNS monitoring. Check your primary domain's A record. Check MX records if email is critical. Interval: 5 minutes. Catches DNS-level problems before HTTP failures.
3. SSL monitoring. TLS checks for every HTTPS domain. Alert threshold: 30 days before expiration. Takes 30 seconds to set up, prevents one of the most frustrating types of downtime.
4. Heartbeat for cron jobs. For backups, scheduled syncs, cleanup tasks—add a curl call to a heartbeat URL at the end of each script. If the job doesn't run, you get an alert.
5. Alerting. Connect at least 2 channels: primary (Slack/Telegram for quick response) + fallback (email for completeness). Set hard incident thresholds: 2–3 cycles for confirmation.
6. Status page. Create a public status page with components organized by user workflows. Add a custom domain (status.yourdomain.com). Configure automatic updates on incidents.
7. Response time thresholds. For each HTTP monitor, set warning (2x baseline) and critical (5x baseline) thresholds. Rising response time is an early warning sign of outages.
Advanced techniques
JSON Path Assertions
For API endpoints: validate specific JSON fields, not just the status code. $.status = "ok", $.data.length > 0, $.version contains "2.". This catches cases where the API returns 200 but the data is wrong.
Multi-step transactions
Real service usage is a chain of requests: Login → get token → fetch data → update record. Monitoring individual endpoints might show "all green," but the chain breaks due to issues between steps. Multi-step monitoring checks the entire flow.
Custom headers and authentication
Monitoring protected endpoints requires authorization. Create a dedicated monitoring user with minimal permissions. Use long-lived API keys or service tokens (not personal credentials). Add Bearer tokens or API keys to custom headers in your HTTP checks.
Monitoring behind a CDN
CDNs cache responses and can mask origin server problems. For complete monitoring: check both the cached version (through the CDN) and the origin directly (by IP or bypass header). This way you know if the CDN works but the origin is down—or vice versa.
Common mistakes
Monitoring only the homepage. The homepage is often on a CDN and works even when the backend is dead. Monitor API endpoints, login flows, checkout—everything that depends on your code and database.
Single region monitoring. One probe = false alerts from the probe's network problems. Use at least 3 regions for quorum.
Status code only, no body validation. 200 OK with an empty body is a silent failure. Add keyword checks.
Overly aggressive alerting. Alerting on every single timeout → alert fatigue within a week → your team ignores all alerts, including real ones. Use hard incident thresholds.
Forgotten SSL monitoring. "We have Let's Encrypt with auto-renewal" → auto-renewal silently fails → 60 days later your certificate expires → complete outage Saturday night.
No cron job monitoring. Your backup doesn't run for 3 weeks, but you don't know until you need to recover. Heartbeat monitoring is one curl line and can save you from disaster.
FAQ
What is uptime monitoring?
Uptime monitoring is the practice of continuously checking whether your website, API, or online service is accessible to users. External monitoring agents send requests to your endpoints at regular intervals (every 30 seconds to 5 minutes) from multiple geographic locations. If the endpoint doesn't respond correctly, the system creates an incident and sends alerts via email, Slack, Telegram, or other channels.
How does uptime monitoring work?
A monitoring service runs agents in multiple locations (datacenters, cloud regions). Each agent periodically sends an HTTP request (or TCP, ICMP, DNS query) to your endpoint. The agent records the response: status code, response time, TLS validity, response body. Results are sent to a central control plane that evaluates them against your thresholds. If multiple agents confirm a failure, an incident is created and you get notified.
What's the difference between uptime monitoring and APM?
Uptime monitoring checks your service from outside — like a user would. It answers 'is it up and fast?' APM (Application Performance Monitoring) instruments your code from inside — it answers 'which function is slow and why?' You need both: uptime monitoring for instant outage detection, APM for root cause analysis.
How often should I check my website?
For production services: every 30-60 seconds. For staging or internal tools: every 5 minutes. The right frequency depends on your SLA — if you promise 99.9% uptime, a 5-minute check interval means you could miss up to 5 minutes of downtime. At 30-second intervals, you detect issues within 1 minute.
Is uptime monitoring necessary if I use a cloud provider?
Yes. Cloud providers (AWS, GCP, Azure) guarantee infrastructure uptime, not your application uptime. Your code can crash, your database can fill up, your DNS can misconfigure — all while the cloud VM runs perfectly. Uptime monitoring checks your actual service, not the infrastructure underneath it.
What is a good uptime percentage?
99.9% (three nines) is the standard for most production SaaS services — that's about 8.7 hours of allowed downtime per year. 99.95% is common for payment systems and auth services. 99.99% (four nines) is enterprise-grade and requires significant infrastructure investment. Anything below 99.5% usually signals reliability problems.
How many monitoring regions do I need?
Minimum 3 for basic quorum (2-of-3 agreement prevents false positives). 5-7 for solid coverage of a single continent. 10+ for global services or when you need high-confidence incident detection. More regions = fewer false alerts and faster, more accurate detection.
Can uptime monitoring detect slow performance, not just outages?
Yes. Modern uptime monitors track response time (TTFB and total) alongside availability. You can set thresholds: alert if response time exceeds 500ms for 3 consecutive checks. This catches performance degradation before it becomes a full outage — the 'site is slow' stage before the 'site is down' stage.