Incident Severity Levels: P1-P4 Classification Guide

How to classify incidents by severity. Clear P1 to P4 definitions, escalation rules, response time expectations, and real-world examples for each level.

2026-03-26 · 10 min · Operations Guide

Without incident severity classification, all problems seem equally critical. Server down for 30 seconds? P1. CSS not loading? P1. Both require immediate response—and your team burns out in a month. Proper P1-P4 classification lets you allocate resources smartly: respond fast to real crises and handle routine issues efficiently.

This guide defines clear boundaries between levels, shows how to decide in gray areas, and explains the relationship between severity and response time. Goal: your team spends hours on what truly matters, not fighting false alarms.

Why Classification Matters

Imagine your company has no classification, only alerts. Something unusual happens—everyone gets an SMS. Then you figure out what happened, and soon:

On-call engineer can't sleep at night (even minor issues wake them)
Team stops trusting alerts (alert fatigue)
Real critical incident at 3 AM goes unnoticed because calls are lost in the noise
Burnout and staff turnover

Classification creates a contract: "P1 is when you call, P2 is when you message, P3 is check tomorrow". Clear rules reduce stress and increase effectiveness.

P1: Critical

Definition

Service completely unavailable for a significant portion of users, data loss occurring, or security threat present. Business loses money every minute.

P1 Examples

Complete outage: all three regions down simultaneously
Data loss: database corrupted, transactions lost
Security breach: vulnerability discovered, attacker has access
Payment processing down: payments not going through, revenue lost
Authentication broken: no one can log in, app is useless

Response Time

First response: 5 minutes (from alert trigger)

Updates: every 15 minutes

Escalation: immediately CEO, CTO, senior engineer

Actions for P1

Call on-call engineer by phone (not Slack)
Summon incident commander and start video call
Start communication channel (update status every 15 minutes)
All leaders on Zoom observing and helping with decisions
Meanwhile: root cause analysis (RCA) can wait, stabilization is priority

P2: Major

Definition

Significant feature degradation or partial outage. Some users affected, but workaround exists or system partially works.

P2 Examples

One region down: EU users can use US, but with latency
Degradation: API works but responds in 5+ seconds instead of 200ms
Partial feature down: data export broken, but rest works
Database connection pool exhausted: new requests hang, old ones work
Memory leak in worker: processes crash each hour, but auto-restart

Response Time

First response: 15 minutes

Updates: every 30 minutes

Escalation: on-call engineer + team lead

Actions for P2

Message on-call engineer (SMS included)
Create incident in Slack and invite team
No need to call CEO, but team lead must know
Update status page (if public) every 30 minutes
Simultaneously collect information for RCA

P3: Minor

Definition

Small issue with limited user impact. Simple workaround exists or affects very few people.

P3 Examples

CSS broken on settings page: buttons invisible, but function works via API
Email notifications delayed: arrive an hour late, but they arrive
One user cannot log in: specific to one person (might be browser)
Timezone conversion bug: users from one country get wrong timezone
Documentation page 404: docs unavailable, but function works

Response Time

First response: 1 hour (business hours)

Updates: once per day

Escalation: engineer will look during the day

Actions for P3

Create issue in issue tracker
Post to Slack (text, no @here)
On-call engineer will check during day in bug fix time block
No emergency response needed
Can be resolved next sprint

P4: Low

Definition

Cosmetic issue, feature request, or very rare edge case. No impact on functionality or UX.

P4 Examples

Typo in docs: "montioring" instead of "monitoring"
Button color slightly off: gray shade instead of dark gray
Exception message unclear: shows "Internal Error 500" instead of reason
Performance could be better: API responds in 100ms, ideally 50ms
Feature request: users want Excel export (CSV already exists)

Response Time

First response: next sprint or not urgent

Updates: whenever time allows

Escalation: none

Actions for P4

Create issue in backlog
Post in general channel (no urgency)
Engineer will check when they have free time
This doesn't block anyone

Response Time Table

Severity	First Response	Updates	Escalation	Notification Method
P1	5 minutes	Every 15 min	CEO, CTO, IR lead	Phone call + SMS
P2	15 minutes	Every 30 min	Team lead, on-call	SMS + Slack
P3	1 hour	Daily summary	On-call engineer	Email + Slack
P4	Next sprint	As needed	None	Backlog issue

Gray Areas: How to Decide

Not all issues fit cleanly in one category. Here's how to think about edge cases.

Case: One API endpoint is slow

Questions: Which endpoint? How many users affected? Is there a workaround?

If it's checkout API affecting 10% of payments → P1 (revenue loss)
If it's analytics API affecting 50% of data → P2 (works slowly)
If it's notifications API delayed 2 hours → P3 (arrives later)

Case: Error spike in logs

Questions: Is this a real feature failure? Do users see it?

If it's payment processing exception → P1 (payments fail)
If it's cache update warning → P3 (self-healing)
If it's debug log accidentally on ERROR level → P4 (false alarm)

Rule of Thumb

Ask yourself: "Will users lose money, data, or be disappointed within an hour?"

Yes, critical: P1
Yes, but workaround exists: P2
Maybe, unlikely: P3
No, don't worry: P4

Severity vs Priority: Key Difference

Severity = how bad the situation (losses). Priority = how fast to respond (urgency).

Scenario	Severity	Priority	Response
Database down	P1	Critical	Page on-call immediately
Typo on landing page at 3 AM	P4	Critical	Fix before morning (high priority, low severity)
Deprecated API removed from docs	P4	Low	Fix in next sprint
One region slow, 1% users affected	P3	High	Investigate soon if pattern continues

Key takeaway: prioritize severity and priority independently, then combine. P4 can be urgent, P2 can be deferred.

Escalation Rules

P1 escalation chain

1. Alert fired → on-call engineer notified (phone + SMS)

2. Within 5 min: engineer confirms, creates incident bridge

3. Within 10 min: team lead + incident commander join

4. Within 15 min: CTO + VP Ops informed (they may join)

5. Every 15 min: status update to all stakeholders

P2 escalation chain

1. Alert fired → on-call engineer notified (SMS)

2. Within 15 min: engineer investigates, creates incident

3. Within 30 min: team lead informed (Slack mention)

4. Update status every 30 min (internal Slack thread)

P3 + P4: No escalation

Engineer picks up during business hours or next shift. Document in issue tracker.

Common Mistakes

Mistake 1: Severity Inflation (Everything P1)

Team classifies routine issues as P1 to fix faster. Result: on-call engineer sleeps poorly, alert fatigue, real P1s missed. Solution: audit classifications. Monthly review: "Was this really P1?" Train team if not.

Mistake 2: No Clear Definitions

Each engineer interprets P2 vs P3 differently. One thinks "API slow" is P2, another says P3. Result: chaos. Solution: document clear criteria, show examples, update with experience. Make it part of on-call playbook.

Mistake 3: Severity Doesn't Change

Classified as P1, later found it's P3, but no downgrade. Team wastes hours. Solution: re-evaluate as info arrives. Tell stakeholders: "We believe this is P3, not P1, because..."

Mistake 4: Severity == Priority

Think high severity means high priority. But P4 bug for CEO may be more critical than P2 for one user. Solution: separate severity and priority. Classify both.

Severity and Monitoring: Practical Application

Proper classification in monitoring directly connects to severity levels. AtomPing lets you configure alert policies so different incidents trigger different channels.

Example AtomPing Alert Policy Configuration

Target: API Server (eu-fra1, us-east-1, ap-sin1)

Rule 1 - P1 (Critical):
  Trigger: 3+ regions DOWN in same cycle
  Channels: PagerDuty + SMS + call
  
Rule 2 - P2 (Major):
  Trigger: 1 region DOWN or DEGRADED for 2+ cycles
  Channels: Slack #incidents + email
  
Rule 3 - P3 (Minor):
  Trigger: 1 region slow (TTFB >2s) for 3+ cycles
  Channels: Email only

Using this configuration, you ensure P1 gets immediate attention, P2 is handled properly, and P3 doesn't spam channels.

Checklist: Is Your Team Ready?

☐You have written P1-P4 definitions with examples
☐On-call playbook has response times for each level
☐All engineers trained and agree with definitions
☐Alert policies in monitoring match severity levels
☐You regularly audit incidents (monthly) to catch inflation
☐Process exists to reclassify severity as info arrives
☐Severity and priority defined separately

Summary

Severity classification P1-P4 isn't bureaucracy, it's a tool for sane incident response. Clear definitions reduce stress, prevent burnout, and ensure critical incidents get proper attention.

Remember: your team is overloaded because everything is urgent. Proper severity classification frees capacity for what truly matters.

Related Resources

→ Incident Management Guide — complete incident management from alert to RCA
→ On-Call Rotation Best Practices — how to organize on-call rotation to prevent burnout
→ Incident Communication Templates — ready-to-use message templates for each severity level
→ Post-Incident Review Guide — how to conduct blameless RCA and prevent recurrence

FAQ

What is the difference between incident severity and priority?

Severity describes the business impact of an incident (how bad it is). Priority describes the urgency of response (how fast we must react). A P2 incident might have high priority (respond in 15 min) but actually be cosmetic in nature. A P4 issue might be high priority if it affects a key stakeholder but has low severity. Never confuse the two: define both separately in your incident response process.

Can an incident change severity during its lifecycle?

Yes, frequently. An incident might start as P3 (single region down) but escalate to P1 if it spreads to multiple regions. Conversely, a P1 incident might de-escalate to P2 once the main issue is mitigated and only cleanup remains. Always re-evaluate severity as new information arrives. Update stakeholders when severity changes: 'We initially classified this as P1, but it's now P2 because the primary database recovered.'

How do you decide if an incident is P1 or P2 when multiple regions are affected?

Look at user impact, not region count. P1 = complete outage or critical data loss, regardless of region count. P2 = significant degradation affecting a region subset. If 2 regions are down but users can route to a 3rd region with acceptable latency, it might be P2 (not P1). If 1 region is down but it's your primary revenue source, it's P1. Severity = impact, not topology.

Should we have different response times for different teams (frontend vs backend vs ops)?

Yes. A P1 incident requires all teams to respond within 5 minutes total (including alerting/escalation overhead). This doesn't mean every team member must be at their keyboard in 5 min—it means the on-call rotation must have someone ready. Backend on-call might respond in 2 min, frontend in 3 min, ops in 1 min. Each team's SLA is part of the overall incident response SLA.

What does 'severity inflation' mean and why is it a problem?

Severity inflation happens when teams classify routine issues as P1 or P2 to get faster response. Over time, P1 means nothing—every issue is P1, so nothing gets proper investigation. This burns out on-call teams, causes alert fatigue, and delays actual critical incidents. Prevent it by enforcing definitions, auditing classifications, and training teams on the cost of false alarms.

How does AtomPing's alert system connect to incident severity?

AtomPing detects outages and sends alerts based on your configured thresholds. Use alert policy rules to trigger escalation at different severity levels: configure a critical alert (P1) to trigger when 3 regions fail, a major alert (P2) when 1 region fails, minor alert (P3) for degradation. Link alert policies to specific notification channels: P1 → PagerDuty + SMS + call, P2 → Slack + email, P3 → email only. This maps monitoring directly to severity response.

Start monitoring your infrastructure

Start Free View Pricing

Monitoring

Features

Tools

Resources

Incident Severity Levels: P1-P4 Classification Guide

Why Classification Matters

P1: Critical

P1 Examples

Actions for P1

P2: Major

P2 Examples

Actions for P2

P3: Minor

P3 Examples

Actions for P3

P4: Low

P4 Examples

Actions for P4

Response Time Table

Gray Areas: How to Decide

Case: One API endpoint is slow

Case: Error spike in logs

Rule of Thumb

Severity vs Priority: Key Difference

Escalation Rules

Common Mistakes

Mistake 1: Severity Inflation (Everything P1)

Mistake 2: No Clear Definitions

Mistake 3: Severity Doesn't Change

Mistake 4: Severity == Priority

Severity and Monitoring: Practical Application

Checklist: Is Your Team Ready?

Summary

Related Resources

FAQ

Start monitoring your infrastructure