Without incident severity classification, all problems seem equally critical. Server down for 30 seconds? P1. CSS not loading? P1. Both require immediate response—and your team burns out in a month. Proper P1-P4 classification lets you allocate resources smartly: respond fast to real crises and handle routine issues efficiently.
This guide defines clear boundaries between levels, shows how to decide in gray areas, and explains the relationship between severity and response time. Goal: your team spends hours on what truly matters, not fighting false alarms.
Why Classification Matters
Imagine your company has no classification, only alerts. Something unusual happens—everyone gets an SMS. Then you figure out what happened, and soon:
- On-call engineer can't sleep at night (even minor issues wake them)
- Team stops trusting alerts (alert fatigue)
- Real critical incident at 3 AM goes unnoticed because calls are lost in the noise
- Burnout and staff turnover
Classification creates a contract: "P1 is when you call, P2 is when you message, P3 is check tomorrow". Clear rules reduce stress and increase effectiveness.
P1: Critical
Definition
Service completely unavailable for a significant portion of users, data loss occurring, or security threat present. Business loses money every minute.
P1 Examples
- Complete outage: all three regions down simultaneously
- Data loss: database corrupted, transactions lost
- Security breach: vulnerability discovered, attacker has access
- Payment processing down: payments not going through, revenue lost
- Authentication broken: no one can log in, app is useless
Response Time
First response: 5 minutes (from alert trigger)
Updates: every 15 minutes
Escalation: immediately CEO, CTO, senior engineer
Actions for P1
- Call on-call engineer by phone (not Slack)
- Summon incident commander and start video call
- Start communication channel (update status every 15 minutes)
- All leaders on Zoom observing and helping with decisions
- Meanwhile: root cause analysis (RCA) can wait, stabilization is priority
P2: Major
Definition
Significant feature degradation or partial outage. Some users affected, but workaround exists or system partially works.
P2 Examples
- One region down: EU users can use US, but with latency
- Degradation: API works but responds in 5+ seconds instead of 200ms
- Partial feature down: data export broken, but rest works
- Database connection pool exhausted: new requests hang, old ones work
- Memory leak in worker: processes crash each hour, but auto-restart
Response Time
First response: 15 minutes
Updates: every 30 minutes
Escalation: on-call engineer + team lead
Actions for P2
- Message on-call engineer (SMS included)
- Create incident in Slack and invite team
- No need to call CEO, but team lead must know
- Update status page (if public) every 30 minutes
- Simultaneously collect information for RCA
P3: Minor
Definition
Small issue with limited user impact. Simple workaround exists or affects very few people.
P3 Examples
- CSS broken on settings page: buttons invisible, but function works via API
- Email notifications delayed: arrive an hour late, but they arrive
- One user cannot log in: specific to one person (might be browser)
- Timezone conversion bug: users from one country get wrong timezone
- Documentation page 404: docs unavailable, but function works
Response Time
First response: 1 hour (business hours)
Updates: once per day
Escalation: engineer will look during the day
Actions for P3
- Create issue in issue tracker
- Post to Slack (text, no @here)
- On-call engineer will check during day in bug fix time block
- No emergency response needed
- Can be resolved next sprint
P4: Low
Definition
Cosmetic issue, feature request, or very rare edge case. No impact on functionality or UX.
P4 Examples
- Typo in docs: "montioring" instead of "monitoring"
- Button color slightly off: gray shade instead of dark gray
- Exception message unclear: shows "Internal Error 500" instead of reason
- Performance could be better: API responds in 100ms, ideally 50ms
- Feature request: users want Excel export (CSV already exists)
Response Time
First response: next sprint or not urgent
Updates: whenever time allows
Escalation: none
Actions for P4
- Create issue in backlog
- Post in general channel (no urgency)
- Engineer will check when they have free time
- This doesn't block anyone
Response Time Table
| Severity | First Response | Updates | Escalation | Notification Method |
|---|---|---|---|---|
| P1 | 5 minutes | Every 15 min | CEO, CTO, IR lead | Phone call + SMS |
| P2 | 15 minutes | Every 30 min | Team lead, on-call | SMS + Slack |
| P3 | 1 hour | Daily summary | On-call engineer | Email + Slack |
| P4 | Next sprint | As needed | None | Backlog issue |
Gray Areas: How to Decide
Not all issues fit cleanly in one category. Here's how to think about edge cases.
Case: One API endpoint is slow
Questions: Which endpoint? How many users affected? Is there a workaround?
- If it's checkout API affecting 10% of payments → P1 (revenue loss)
- If it's analytics API affecting 50% of data → P2 (works slowly)
- If it's notifications API delayed 2 hours → P3 (arrives later)
Case: Error spike in logs
Questions: Is this a real feature failure? Do users see it?
- If it's payment processing exception → P1 (payments fail)
- If it's cache update warning → P3 (self-healing)
- If it's debug log accidentally on ERROR level → P4 (false alarm)
Rule of Thumb
Ask yourself: "Will users lose money, data, or be disappointed within an hour?"
- Yes, critical: P1
- Yes, but workaround exists: P2
- Maybe, unlikely: P3
- No, don't worry: P4
Severity vs Priority: Key Difference
Severity = how bad the situation (losses). Priority = how fast to respond (urgency).
| Scenario | Severity | Priority | Response |
|---|---|---|---|
| Database down | P1 | Critical | Page on-call immediately |
| Typo on landing page at 3 AM | P4 | Critical | Fix before morning (high priority, low severity) |
| Deprecated API removed from docs | P4 | Low | Fix in next sprint |
| One region slow, 1% users affected | P3 | High | Investigate soon if pattern continues |
Key takeaway: prioritize severity and priority independently, then combine. P4 can be urgent, P2 can be deferred.
Escalation Rules
P1 escalation chain
1. Alert fired → on-call engineer notified (phone + SMS)
2. Within 5 min: engineer confirms, creates incident bridge
3. Within 10 min: team lead + incident commander join
4. Within 15 min: CTO + VP Ops informed (they may join)
5. Every 15 min: status update to all stakeholders
P2 escalation chain
1. Alert fired → on-call engineer notified (SMS)
2. Within 15 min: engineer investigates, creates incident
3. Within 30 min: team lead informed (Slack mention)
4. Update status every 30 min (internal Slack thread)
P3 + P4: No escalation
Engineer picks up during business hours or next shift. Document in issue tracker.
Common Mistakes
Mistake 1: Severity Inflation (Everything P1)
Team classifies routine issues as P1 to fix faster. Result: on-call engineer sleeps poorly, alert fatigue, real P1s missed. Solution: audit classifications. Monthly review: "Was this really P1?" Train team if not.
Mistake 2: No Clear Definitions
Each engineer interprets P2 vs P3 differently. One thinks "API slow" is P2, another says P3. Result: chaos. Solution: document clear criteria, show examples, update with experience. Make it part of on-call playbook.
Mistake 3: Severity Doesn't Change
Classified as P1, later found it's P3, but no downgrade. Team wastes hours. Solution: re-evaluate as info arrives. Tell stakeholders: "We believe this is P3, not P1, because..."
Mistake 4: Severity == Priority
Think high severity means high priority. But P4 bug for CEO may be more critical than P2 for one user. Solution: separate severity and priority. Classify both.
Severity and Monitoring: Practical Application
Proper classification in monitoring directly connects to severity levels. AtomPing lets you configure alert policies so different incidents trigger different channels.
Example AtomPing Alert Policy Configuration
Target: API Server (eu-fra1, us-east-1, ap-sin1)
Rule 1 - P1 (Critical):
Trigger: 3+ regions DOWN in same cycle
Channels: PagerDuty + SMS + call
Rule 2 - P2 (Major):
Trigger: 1 region DOWN or DEGRADED for 2+ cycles
Channels: Slack #incidents + email
Rule 3 - P3 (Minor):
Trigger: 1 region slow (TTFB >2s) for 3+ cycles
Channels: Email only Using this configuration, you ensure P1 gets immediate attention, P2 is handled properly, and P3 doesn't spam channels.
Checklist: Is Your Team Ready?
- ☐You have written P1-P4 definitions with examples
- ☐On-call playbook has response times for each level
- ☐All engineers trained and agree with definitions
- ☐Alert policies in monitoring match severity levels
- ☐You regularly audit incidents (monthly) to catch inflation
- ☐Process exists to reclassify severity as info arrives
- ☐Severity and priority defined separately
Summary
Severity classification P1-P4 isn't bureaucracy, it's a tool for sane incident response. Clear definitions reduce stress, prevent burnout, and ensure critical incidents get proper attention.
Remember: your team is overloaded because everything is urgent. Proper severity classification frees capacity for what truly matters.
Related Resources
- → Incident Management Guide — complete incident management from alert to RCA
- → On-Call Rotation Best Practices — how to organize on-call rotation to prevent burnout
- → Incident Communication Templates — ready-to-use message templates for each severity level
- → Post-Incident Review Guide — how to conduct blameless RCA and prevent recurrence
FAQ
What is the difference between incident severity and priority?
Severity describes the business impact of an incident (how bad it is). Priority describes the urgency of response (how fast we must react). A P2 incident might have high priority (respond in 15 min) but actually be cosmetic in nature. A P4 issue might be high priority if it affects a key stakeholder but has low severity. Never confuse the two: define both separately in your incident response process.
Can an incident change severity during its lifecycle?
Yes, frequently. An incident might start as P3 (single region down) but escalate to P1 if it spreads to multiple regions. Conversely, a P1 incident might de-escalate to P2 once the main issue is mitigated and only cleanup remains. Always re-evaluate severity as new information arrives. Update stakeholders when severity changes: 'We initially classified this as P1, but it's now P2 because the primary database recovered.'
How do you decide if an incident is P1 or P2 when multiple regions are affected?
Look at user impact, not region count. P1 = complete outage or critical data loss, regardless of region count. P2 = significant degradation affecting a region subset. If 2 regions are down but users can route to a 3rd region with acceptable latency, it might be P2 (not P1). If 1 region is down but it's your primary revenue source, it's P1. Severity = impact, not topology.
Should we have different response times for different teams (frontend vs backend vs ops)?
Yes. A P1 incident requires all teams to respond within 5 minutes total (including alerting/escalation overhead). This doesn't mean every team member must be at their keyboard in 5 min—it means the on-call rotation must have someone ready. Backend on-call might respond in 2 min, frontend in 3 min, ops in 1 min. Each team's SLA is part of the overall incident response SLA.
What does 'severity inflation' mean and why is it a problem?
Severity inflation happens when teams classify routine issues as P1 or P2 to get faster response. Over time, P1 means nothing—every issue is P1, so nothing gets proper investigation. This burns out on-call teams, causes alert fatigue, and delays actual critical incidents. Prevent it by enforcing definitions, auditing classifications, and training teams on the cost of false alarms.
How does AtomPing's alert system connect to incident severity?
AtomPing detects outages and sends alerts based on your configured thresholds. Use alert policy rules to trigger escalation at different severity levels: configure a critical alert (P1) to trigger when 3 regions fail, a major alert (P2) when 1 region fails, minor alert (P3) for degradation. Link alert policies to specific notification channels: P1 → PagerDuty + SMS + call, P2 → Slack + email, P3 → email only. This maps monitoring directly to severity response.