Intelligent Incident Detection
Because not every blip deserves an alert
Soft vs Hard Incidents
Soft = 1 region fails (potential issue). Hard = 3+ regions fail (confirmed outage). Alert only on hard incidents to eliminate noise.
Recovery Confirmation
Don't alert 'up' after one successful check. Require 2-3 consecutive successes before marking recovered.
Multi-Region Validation
Confirm incidents from multiple locations before alerting. Network blips don't trigger false alarms.
Maintenance Windows
Schedule downtime for deployments. No alerts during maintenance -- resume monitoring automatically after.
Incident Timeline
Complete history: when incident started, which regions failed, when recovered, total downtime.
Alert Suppression
Temporarily silence alerts for specific monitors. No notifications until you re-enable.
How Smart Incident Detection Works
Advanced logic prevents alert fatigue
Multi-Check Confirmation
First failed check doesn't trigger alert. System waits for 2-3 consecutive failures to confirm real incident.
Regional Verification
For multi-region monitors, require 3+ regions to fail before alerting. Single-region issues don't wake you up.
Smart Recovery
After incident, require 2-3 consecutive successes before marking 'up'. Flapping services don't spam recovery alerts.
Who Benefits from Smart Incident Management?
Stop alert fatigue, start sleeping better
On-Call Teams
Eliminate 3AM false alarms. Only get woken for real incidents that need immediate attention.
Enterprise Operations
Manage hundreds of monitors without alert overload. Maintenance windows prevent deployment noise.
Small Teams
Can't afford 24/7 on-call? Smart alerts ensure you're only notified for real issues.
SRE Teams
Reduce alert fatigue with intelligent thresholds. Focus on real problems, not false positives.
Incident Management FAQ
Common questions about smart alerts
What's a soft incident vs hard incident?
Soft = down in 1-2 regions (might be network blip). Hard = down in 3+ regions or consecutive failures (confirmed outage). You choose what triggers alerts.
How do recovery cycles work?
After incident, system requires 2-3 consecutive successful checks before marking 'recovered'. Prevents flapping alerts when services bounce up/down.
What are maintenance windows?
Scheduled periods where monitoring continues but alerts are suppressed. Perfect for deployments, maintenance, or planned downtime.
Can I temporarily silence specific monitors?
Yes! Mute/unmute any monitor anytime. Useful when you know something's wrong and are actively fixing it.
How many false positives should I expect?
With default settings (3+ regions, 2 consecutive failures): near zero. Network blips and single-location issues are filtered out automatically.
Can I customize incident detection thresholds?
Yes! Configure per-monitor: number of regions required, consecutive failures needed, recovery cycles required. Tune for your needs.
What happens during a partial outage (some regions down)?
You'll see a degraded status. Whether it triggers alerts depends on your threshold (e.g., 'alert if 3+ regions down').