On-call is essential for any team running production systems. But poorly organized on-call becomes a burnout factory: false-alarm pages at night, missing runbooks, unclear escalation policies, and the feeling of being alone with production.
Well-organized on-call is predictable, fair, and tool-supported. Below are the practices that transform on-call from a nightmare into a manageable process.
Scheduling: On-Call Structure
Basic Rotation
Primary on-call: first responder. Receives all alerts. Must acknowledge within 5 minutes.
Secondary on-call: backup. Receives alert if primary doesn't acknowledge within 10-15 minutes. Also available for escalation of complex incidents.
Duration: 1 week (Monday 10:00 → next Monday 10:00). Handoff during business hours, not weekends.
Rotation direction: round-robin, predictable. Schedule 2-3 months ahead. Allow swaps between participants.
Follow-the-Sun
For distributed teams (EU + US + Asia): on-call rotates by timezone. EU engineer is on-call 08:00-16:00 CET, US engineer 08:00-16:00 EST, Asia covers night hours. Nobody wakes up at 3 AM.
Requirement: at least 2 people per timezone for backup.
Challenge: handoff between timezones requires a clear process (status update, open incidents, context transfer).
Alert Fatigue: The On-Call Enemy
Alert fatigue is when an on-call engineer stops responding to alerts because most are false or irrelevant. Research shows that after 3-5 false alarms in a night, engineers start ignoring alerts. One real incident missed = P1 without response.
Sources of Alert Fatigue
False positives: monitoring fires due to a network glitch, not a real outage. Solution: quorum confirmation in AtomPing—2 out of 3 agents confirm the problem before alerting.
Non-actionable alerts: "disk usage 81%" at 3 AM. This can wait until morning. Solution: severity-based routing—P1/P2 page, P3/P4 go to Slack.
Alert storms: cascading failure creates 20 alerts at once. Solution: grouping and deduplication. One incident = one alert, even if 10 monitors failed.
Flapping: monitor oscillates UP↔DOWN every 2 minutes. Solution: hysteresis—require N consecutive failures before alert and M consecutive successes before recovery.
Metric: Alerts per On-Call Shift
Healthy level: 0-2 pages per week (out-of-hours). 5-10 is tolerable. 10+ is a problem to solve.
Track it: after each shift, brief report: total alerts, how many were actionable, how many were false. Trend monthly.
Escalation Policies
An escalation policy defines: who gets the alert, in what order, and after how many minutes it escalates.
Level 1 (0 min): Primary on-call → Slack + push notification. Acknowledge timeout: 5 minutes.
Level 2 (5 min): Secondary on-call → Slack + push + SMS. Acknowledge timeout: 10 minutes.
Level 3 (15 min): Engineering manager → phone call. Acknowledge timeout: 10 minutes.
Level 4 (25 min): CTO / VP Engineering → phone call. This is P1 and nobody's answered in 25 minutes.
Severity determines which escalation level to start at. P1 (revenue impact) starts at level 1, immediate. P4 (cosmetic) goes to Slack only, no paging.
Severity-Based Routing
P1—Critical: revenue loss, data loss, full outage. → Phone + SMS + push. Wake them up. Example: checkout API 503, database unreachable.
P2—High: significant degradation, partial outage. → Push + Slack DM. Example: API latency 10x normal, one region down.
P3—Medium: minor degradation, workaround exists. → Slack channel only. Example: email service slow, analytics delayed.
P4—Low: cosmetic, informational. → Slack channel, review next business day. Example: disk usage 80%, cert expires in 20 days.
Runbooks: Document Your Response
A runbook is a step-by-step guide for an on-call engineer: what to do when a specific alert fires. Without a runbook, an engineer (especially a junior or someone from another team) spends 20 minutes understanding context. With a runbook—2 minutes.
Runbook structure:
Alert name: «Checkout API 503»
What it means: Checkout endpoint returns 503. Users can't pay.
Severity: P1 — revenue impact
First check: 1) Open Grafana dashboard "Payment Service" 2) Check pod status: kubectl get pods -n payments
Common causes: Database connection exhaustion → restart pods. OOMKilled → check memory leak, rollback last deployment. Stripe API down → check status.stripe.com
Escalate if: Cause unclear after 15 minutes → escalate to payment team lead
Rule: every alert that wakes someone up at night must have a runbook. If an alert doesn't have a runbook—either write it or delete the alert (it's not actionable).
Handoff: Shift Transition
When: fixed time during business hours (e.g., Monday 10:00). Not Friday evening.
Format: 5-10 minute sync (Slack thread or brief call):
1. Open incidents (if any)—status, context, next steps
2. Alerts last week—what fired, were they actionable, do thresholds need tuning
3. Planned changes—deployments, migrations, infrastructure work
4. Known issues—"Redis sometimes slows after 2 AM backup, it's normal, don't page"
Preventing Burnout
Compensation
On-call restricts personal time. Engineers can't go hiking, have wine at dinner, or watch a movie without their phone. This deserves compensation.
Models:
— Fixed stipend: $200-500 per on-call week (region and frequency dependent)
— Hourly pay for time spent on incidents
— Extra PTO: 0.5-1 day off after each on-call week
— Combination: fixed + extra PTO after heavy weeks (3+ night pages)
Invest in Silence
Every false positive is on-call tech debt. Allocate 10-20% sprint capacity to "on-call improvements": tune alerts, write runbooks, fix flaky tests, eliminate root causes of repeated incidents.
Process: after each shift, review. Which alerts fired? Were they actionable? If not, create a task to fix/tune/remove. Track "alerts per shift" as a metric. Goal: zero false night pages.
Tools: AtomPing with quorum confirmation and batch anomaly detection significantly reduce false positives. Hysteresis (N consecutive failures before alert) prevents flapping.
On-Call Toolchain
Detection: AtomPing — external monitoring with 30s intervals, 9 check types, quorum confirmation. Minimal false alarms.
Alerting: AtomPing → Slack, Telegram, Discord, email, webhooks. For escalation policies: PagerDuty or OpsGenie (compare alternatives).
Diagnosis: Grafana dashboards (internal metrics), application logs (ELK/Loki).
Communication: Status page (AtomPing) for public updates. Slack #incident channel for internal.
Post-incident: blameless post-mortem after every P1/P2. Action items → sprint backlog.
Healthy On-Call Checklist
Rotation: ≥4 people, weekly shifts, primary + secondary
Alerts: severity-based routing, P1/P2 page, P3/P4 Slack only
False positives: quorum confirmation, hysteresis, less than 2 false alarms per week
Runbooks: every paging alert has a runbook
Escalation: documented policy with timeout at each level
Handoff: structured, during business hours, with context transfer
Compensation: financial or PTO, acknowledged by management
Review: weekly alert review, monthly trend analysis, quarterly process improvement
Related Resources
Incident Management Guide — full cycle from detection to post-mortem
How to Reduce False Alarms — quorum confirmation and batch anomaly detection
PagerDuty Alternatives — tools for on-call scheduling
SLA vs SLO vs SLI — how reliability metrics relate to on-call
FAQ
What is an on-call rotation?
An on-call rotation is a schedule where team members take turns being the primary responder for production incidents. The on-call engineer carries a pager (phone/Slack/PagerDuty) and is responsible for acknowledging and triaging alerts during their shift — typically 1 week, with handoffs on a fixed day.
How long should an on-call shift last?
One week is the most common rotation. Shorter (2-3 days) reduces fatigue but increases handoff overhead. Longer (2 weeks) causes burnout. For small teams (3-4 people), weekly rotation means on-call every 3-4 weeks — sustainable if alert volume is reasonable.
What's the minimum team size for on-call?
4 people minimum for a sustainable rotation. With 3 people, each person is on-call every 3 weeks — borderline. With 2 people, it's every other week — unsustainable long-term. If your team is smaller than 4, consider shared on-call with another team or using a managed incident response service.
How do I reduce on-call alert volume?
Three approaches: (1) Fix the root cause — if the same alert fires weekly, fix the underlying issue, don't just acknowledge it. (2) Tune thresholds — if alerts fire for non-actionable conditions, raise the threshold. (3) Use quorum confirmation — AtomPing's quorum prevents false alarms from waking people up for network glitches.
Should on-call engineers be compensated?
Yes. On-call restricts personal time and causes stress. Common models: flat stipend per on-call week ($200-500), hourly rate for time spent on incidents, extra PTO days per on-call rotation, or a combination. Companies that don't compensate on-call struggle with retention and morale.
What tools do I need for on-call?
Minimum: monitoring (AtomPing for detection), alerting (Slack/Telegram/PagerDuty for notifications), runbooks (documented procedures for common alerts), communication (Slack channel for incident coordination). Advanced: on-call scheduling (PagerDuty, OpsGenie, Better Stack), status page (AtomPing), post-incident review template.