Pricing Blog Compare Glossary
Login Start Free

Incident Severity Levels: P1-P4 Classification Guide

How to classify incidents by severity. P1-P4 definitions, escalation rules, response time expectations, and real-world examples for each level.

2026-03-26 · 10 min · Operations Guide

Without incident severity classification, all problems seem equally critical. Server down for 30 seconds? P1. CSS not loading? P1. Both require immediate response—and your team burns out in a month. Proper P1-P4 classification lets you allocate resources smartly: respond fast to real crises and handle routine issues efficiently.

This guide defines clear boundaries between levels, shows how to decide in gray areas, and explains the relationship between severity and response time. Goal: your team spends hours on what truly matters, not fighting false alarms.

Why Classification Matters

Imagine your company has no classification, only alerts. Something unusual happens—everyone gets an SMS. Then you figure out what happened, and soon:

  • On-call engineer can't sleep at night (even minor issues wake them)
  • Team stops trusting alerts (alert fatigue)
  • Real critical incident at 3 AM goes unnoticed because calls are lost in the noise
  • Burnout and staff turnover

Classification creates a contract: "P1 is when you call, P2 is when you message, P3 is check tomorrow". Clear rules reduce stress and increase effectiveness.

P1: Critical

Definition

Service completely unavailable for a significant portion of users, data loss occurring, or security threat present. Business loses money every minute.

P1 Examples

  • Complete outage: all three regions down simultaneously
  • Data loss: database corrupted, transactions lost
  • Security breach: vulnerability discovered, attacker has access
  • Payment processing down: payments not going through, revenue lost
  • Authentication broken: no one can log in, app is useless

Response Time

First response: 5 minutes (from alert trigger)

Updates: every 15 minutes

Escalation: immediately CEO, CTO, senior engineer

Actions for P1

  • Call on-call engineer by phone (not Slack)
  • Summon incident commander and start video call
  • Start communication channel (update status every 15 minutes)
  • All leaders on Zoom observing and helping with decisions
  • Meanwhile: root cause analysis (RCA) can wait, stabilization is priority

P2: Major

Definition

Significant feature degradation or partial outage. Some users affected, but workaround exists or system partially works.

P2 Examples

  • One region down: EU users can use US, but with latency
  • Degradation: API works but responds in 5+ seconds instead of 200ms
  • Partial feature down: data export broken, but rest works
  • Database connection pool exhausted: new requests hang, old ones work
  • Memory leak in worker: processes crash each hour, but auto-restart

Response Time

First response: 15 minutes

Updates: every 30 minutes

Escalation: on-call engineer + team lead

Actions for P2

  • Message on-call engineer (SMS included)
  • Create incident in Slack and invite team
  • No need to call CEO, but team lead must know
  • Update status page (if public) every 30 minutes
  • Simultaneously collect information for RCA

P3: Minor

Definition

Small issue with limited user impact. Simple workaround exists or affects very few people.

P3 Examples

  • CSS broken on settings page: buttons invisible, but function works via API
  • Email notifications delayed: arrive an hour late, but they arrive
  • One user cannot log in: specific to one person (might be browser)
  • Timezone conversion bug: users from one country get wrong timezone
  • Documentation page 404: docs unavailable, but function works

Response Time

First response: 1 hour (business hours)

Updates: once per day

Escalation: engineer will look during the day

Actions for P3

  • Create issue in issue tracker
  • Post to Slack (text, no @here)
  • On-call engineer will check during day in bug fix time block
  • No emergency response needed
  • Can be resolved next sprint

P4: Low

Definition

Cosmetic issue, feature request, or very rare edge case. No impact on functionality or UX.

P4 Examples

  • Typo in docs: "montioring" instead of "monitoring"
  • Button color slightly off: gray shade instead of dark gray
  • Exception message unclear: shows "Internal Error 500" instead of reason
  • Performance could be better: API responds in 100ms, ideally 50ms
  • Feature request: users want Excel export (CSV already exists)

Response Time

First response: next sprint or not urgent

Updates: whenever time allows

Escalation: none

Actions for P4

  • Create issue in backlog
  • Post in general channel (no urgency)
  • Engineer will check when they have free time
  • This doesn't block anyone

Response Time Table

Severity First Response Updates Escalation Notification Method
P1 5 minutes Every 15 min CEO, CTO, IR lead Phone call + SMS
P2 15 minutes Every 30 min Team lead, on-call SMS + Slack
P3 1 hour Daily summary On-call engineer Email + Slack
P4 Next sprint As needed None Backlog issue

Gray Areas: How to Decide

Not all issues fit cleanly in one category. Here's how to think about edge cases.

Case: One API endpoint is slow

Questions: Which endpoint? How many users affected? Is there a workaround?

  • If it's checkout API affecting 10% of payments → P1 (revenue loss)
  • If it's analytics API affecting 50% of data → P2 (works slowly)
  • If it's notifications API delayed 2 hours → P3 (arrives later)

Case: Error spike in logs

Questions: Is this a real feature failure? Do users see it?

  • If it's payment processing exception → P1 (payments fail)
  • If it's cache update warning → P3 (self-healing)
  • If it's debug log accidentally on ERROR level → P4 (false alarm)

Rule of Thumb

Ask yourself: "Will users lose money, data, or be disappointed within an hour?"

  • Yes, critical: P1
  • Yes, but workaround exists: P2
  • Maybe, unlikely: P3
  • No, don't worry: P4

Severity vs Priority: Key Difference

Severity = how bad the situation (losses). Priority = how fast to respond (urgency).

Scenario Severity Priority Response
Database down P1 Critical Page on-call immediately
Typo on landing page at 3 AM P4 Critical Fix before morning (high priority, low severity)
Deprecated API removed from docs P4 Low Fix in next sprint
One region slow, 1% users affected P3 High Investigate soon if pattern continues

Key takeaway: prioritize severity and priority independently, then combine. P4 can be urgent, P2 can be deferred.

Escalation Rules

P1 escalation chain

1. Alert fired → on-call engineer notified (phone + SMS)

2. Within 5 min: engineer confirms, creates incident bridge

3. Within 10 min: team lead + incident commander join

4. Within 15 min: CTO + VP Ops informed (they may join)

5. Every 15 min: status update to all stakeholders

P2 escalation chain

1. Alert fired → on-call engineer notified (SMS)

2. Within 15 min: engineer investigates, creates incident

3. Within 30 min: team lead informed (Slack mention)

4. Update status every 30 min (internal Slack thread)

P3 + P4: No escalation

Engineer picks up during business hours or next shift. Document in issue tracker.

Common Mistakes

Mistake 1: Severity Inflation (Everything P1)

Team classifies routine issues as P1 to fix faster. Result: on-call engineer sleeps poorly, alert fatigue, real P1s missed. Solution: audit classifications. Monthly review: "Was this really P1?" Train team if not.

Mistake 2: No Clear Definitions

Each engineer interprets P2 vs P3 differently. One thinks "API slow" is P2, another says P3. Result: chaos. Solution: document clear criteria, show examples, update with experience. Make it part of on-call playbook.

Mistake 3: Severity Doesn't Change

Classified as P1, later found it's P3, but no downgrade. Team wastes hours. Solution: re-evaluate as info arrives. Tell stakeholders: "We believe this is P3, not P1, because..."

Mistake 4: Severity == Priority

Think high severity means high priority. But P4 bug for CEO may be more critical than P2 for one user. Solution: separate severity and priority. Classify both.

Severity and Monitoring: Practical Application

Proper classification in monitoring directly connects to severity levels. AtomPing lets you configure alert policies so different incidents trigger different channels.

Example AtomPing Alert Policy Configuration

Target: API Server (eu-fra1, us-east-1, ap-sin1)

Rule 1 - P1 (Critical):
  Trigger: 3+ regions DOWN in same cycle
  Channels: PagerDuty + SMS + call
  
Rule 2 - P2 (Major):
  Trigger: 1 region DOWN or DEGRADED for 2+ cycles
  Channels: Slack #incidents + email
  
Rule 3 - P3 (Minor):
  Trigger: 1 region slow (TTFB >2s) for 3+ cycles
  Channels: Email only

Using this configuration, you ensure P1 gets immediate attention, P2 is handled properly, and P3 doesn't spam channels.

Checklist: Is Your Team Ready?

  • You have written P1-P4 definitions with examples
  • On-call playbook has response times for each level
  • All engineers trained and agree with definitions
  • Alert policies in monitoring match severity levels
  • You regularly audit incidents (monthly) to catch inflation
  • Process exists to reclassify severity as info arrives
  • Severity and priority defined separately

Summary

Severity classification P1-P4 isn't bureaucracy, it's a tool for sane incident response. Clear definitions reduce stress, prevent burnout, and ensure critical incidents get proper attention.

Remember: your team is overloaded because everything is urgent. Proper severity classification frees capacity for what truly matters.

Related Resources

FAQ

What is the difference between incident severity and priority?

Severity describes the business impact of an incident (how bad it is). Priority describes the urgency of response (how fast we must react). A P2 incident might have high priority (respond in 15 min) but actually be cosmetic in nature. A P4 issue might be high priority if it affects a key stakeholder but has low severity. Never confuse the two: define both separately in your incident response process.

Can an incident change severity during its lifecycle?

Yes, frequently. An incident might start as P3 (single region down) but escalate to P1 if it spreads to multiple regions. Conversely, a P1 incident might de-escalate to P2 once the main issue is mitigated and only cleanup remains. Always re-evaluate severity as new information arrives. Update stakeholders when severity changes: 'We initially classified this as P1, but it's now P2 because the primary database recovered.'

How do you decide if an incident is P1 or P2 when multiple regions are affected?

Look at user impact, not region count. P1 = complete outage or critical data loss, regardless of region count. P2 = significant degradation affecting a region subset. If 2 regions are down but users can route to a 3rd region with acceptable latency, it might be P2 (not P1). If 1 region is down but it's your primary revenue source, it's P1. Severity = impact, not topology.

Should we have different response times for different teams (frontend vs backend vs ops)?

Yes. A P1 incident requires all teams to respond within 5 minutes total (including alerting/escalation overhead). This doesn't mean every team member must be at their keyboard in 5 min—it means the on-call rotation must have someone ready. Backend on-call might respond in 2 min, frontend in 3 min, ops in 1 min. Each team's SLA is part of the overall incident response SLA.

What does 'severity inflation' mean and why is it a problem?

Severity inflation happens when teams classify routine issues as P1 or P2 to get faster response. Over time, P1 means nothing—every issue is P1, so nothing gets proper investigation. This burns out on-call teams, causes alert fatigue, and delays actual critical incidents. Prevent it by enforcing definitions, auditing classifications, and training teams on the cost of false alarms.

How does AtomPing's alert system connect to incident severity?

AtomPing detects outages and sends alerts based on your configured thresholds. Use alert policy rules to trigger escalation at different severity levels: configure a critical alert (P1) to trigger when 3 regions fail, a major alert (P2) when 1 region fails, minor alert (P3) for degradation. Link alert policies to specific notification channels: P1 → PagerDuty + SMS + call, P2 → Slack + email, P3 → email only. This maps monitoring directly to severity response.

Start monitoring your infrastructure

Start Free View Pricing