Prometheus shows CPU at 20%, memory at 40%, all pods in Running state. Grafana dashboards are green. But users write to support: "the site won't open".
The reason: your DNS provider went down. Or Cloudflare is caching an old certificate. Or an ISP in Germany has a routing problem. Internal monitoring can't see this — it looks from the inside. You need an outside perspective.
Internal Monitoring: Inside View
Internal monitoring consists of agents and exporters running inside your infrastructure. They collect metrics about how your systems are performing.
What it sees:
— CPU, memory, disk, network I/O for each server/container
— Application metrics: request rate, error rate, latency (RED metrics)
— Database performance: query time, connection pool, replication lag
— Queue depths, worker throughput, job failure rates
— Application logs, traces, profiling data
What it cannot see:
— DNS resolution failures for your users
— SSL/TLS issues (expired cert, chain issues, mixed content)
— CDN outages and cache poisoning
— Load balancer misconfigurations
— ISP routing problems and BGP hijacks
— Firewall rules blocking legitimate traffic
Tools: Prometheus + Grafana, Datadog, New Relic APM, CloudWatch, node_exporter, cAdvisor, OpenTelemetry.
External Monitoring: Outside View
External monitoring uses agents in different geographic locations that do what users do: open URLs, send requests, verify responses.
What it sees:
— Service availability from different regions (synthetic checks)
— Response time from the user's perspective (including DNS, TLS, network latency)
— Response correctness (keyword checks, JSON path assertions)
— DNS resolution, SSL certificate validity, TCP connectivity
— The entire delivery chain: DNS → CDN → Load Balancer → App → DB → Response
What it cannot see:
— Internal metrics (CPU, memory, disk — only indirectly through response time)
— Root cause: is "503" from a down DB? Container crash? Deployment error?
— Internal services not exposed externally
Tools: AtomPing (9 check types, quorum confirmation), Pingdom, UptimeRobot, Better Stack, Checkly.
Blind Spots of Each Approach
Internal Monitoring Only: "Everything is Green, But the Site is Down"
Scenario 1: DNS provider updated NS records with an error. Your server works, but the domain doesn't resolve. Prometheus shows 0% CPU usage (because no one reached the server). Everything looks "fine".
Scenario 2: Let's Encrypt didn't renew the certificate (webhook failed). The app runs, but browsers show "Not Secure". Internal monitoring doesn't check TLS chain.
Scenario 3: CDN (Cloudflare/CloudFront) went down in the EU region. US users work fine, EU users see 502. Your server is healthy.
External Monitoring Only: "The Site is Down, But Why?"
Scenario 1: AtomPing sees HTTP 503 on /api/checkout. But what causes 503? Database connection exhausted? OOMKilled container? Rate limited by Stripe? Without internal metrics, you're diagnosing blind.
Scenario 2: Response time increased from 200ms to 3s. External monitor alerts. But is it a memory leak with increasing GC? A slow query after migration? A noisy neighbor on shared hosting? You need Grafana dashboard.
How They Work Together
Detection → External. AtomPing discovers: checkout API returned 503. Alert goes to Slack/Telegram/PagerDuty. On-call engineer is notified.
Diagnosis → Internal. Engineer opens Grafana. Sees: payment-service pod OOMKilled 2 minutes ago. Memory usage was climbing for 30 minutes (memory leak). Kubernetes restarted the pod, but the new one will crash soon.
Fix → Based on diagnosis. Engineer finds memory leak in the new release, rolls back deployment. Grafana shows memory stabilized. AtomPing confirms: checkout API responds 200 again.
Communication → External-driven. Status page automatically updates from AtomPing. Users see: "Payments: Resolved".
What to Monitor with Each Approach
| Metric | External | Internal |
|---|---|---|
| Is the site available? | Primary | — |
| Response time for users | Primary | Complementary |
| SSL/TLS validity | Primary | — |
| DNS resolution | Primary | — |
| CPU / Memory / Disk | — | Primary |
| Database performance | Indirect | Primary |
| Application errors & logs | — | Primary |
| Queue depths | — | Primary |
| Content correctness | Primary | — |
Practical Plan
Day 1 (5 minutes): AtomPing — HTTP checks on key endpoints, DNS monitor, SSL monitor. Free, 50 monitors.
Day 2 (30 minutes): Health check endpoints in each service. AtomPing monitors for each /health/ready.
Week 1: Prometheus + Grafana for infrastructure metrics. node_exporter on servers.
Week 2: Application metrics (request rate, error rate, latency) via OpenTelemetry or Prometheus client.
Month 1: Distributed tracing (Jaeger/Tempo), custom Grafana dashboards, per-service alerts.
External monitoring is the first line of defense. Internal monitoring is the second. Together they provide a complete picture: what broke (external), why (internal), and how fast you fixed it (both).
Related Resources
Complete Guide to Uptime Monitoring — external monitoring from A to Z
Monitoring Microservices — 4 layers of monitoring for distributed systems
Health Check Endpoint Design — bridge between internal and external monitoring
Incident Management Guide — from detection (external) to diagnosis (internal) and resolution
FAQ
What is internal monitoring?
Internal monitoring collects metrics from inside your infrastructure — CPU usage, memory, disk I/O, application logs, database query performance, queue depths. Tools like Prometheus, Grafana, Datadog agents, and CloudWatch run inside your network and observe system internals.
What is external monitoring?
External monitoring checks your services from outside your infrastructure — the same perspective as your users. It sends HTTP requests, DNS queries, TCP connections, and ICMP pings from distributed locations to verify your service is reachable, fast, and returning correct content. AtomPing is an external monitoring tool.
Can I use only internal monitoring?
No. Internal monitoring has a blind spot: it can't detect problems between your infrastructure and your users. DNS resolution failures, CDN outages, TLS certificate issues, ISP routing problems, and load balancer misconfigurations are invisible to internal tools but immediately caught by external monitoring.
Can I use only external monitoring?
For basic needs — yes, external monitoring covers the most critical question: can users reach my service? But when something breaks, external monitoring tells you WHAT failed, not WHY. Internal monitoring provides the diagnostic detail: which server is overloaded, which query is slow, which container is leaking memory.
How do internal and external monitoring work together?
External monitoring detects the problem (checkout API returns 503). Internal monitoring diagnoses the cause (payment-service pod OOMKilled, database connection pool exhausted). Use external monitoring for alerting (wake someone up) and internal monitoring for troubleshooting (find the root cause).
Which should I set up first?
External monitoring. It answers the most important question — are my users affected? — and takes 5 minutes to set up. Internal monitoring requires agents, dashboards, and configuration. Start with AtomPing (external) for immediate coverage, add Prometheus/Grafana (internal) as your team and infrastructure grow.