The Incident is Over. Now What?
The service is restored, users are happy, the status page is green. Your team exhales and gets back to normal work. Three weeks later, the same service fails for the same reason.
This isn't bad luck. It's the predictable result of skipping a post-incident review. Incidents are signals about weaknesses in your system. If you don't analyze the signal, the system will report the problem again—louder next time.
A post-incident review (PIR), also called a post-mortem, is a structured process for analyzing what happened. The goal isn't to assign blame, but to understand the systemic conditions that led to the failure and prevent it from happening again.
Blameless Culture: What It Means in Practice
"Blameless" isn't a buzzword or charity. It's an engineering approach that produces better results. When people fear punishment, they hide information. When information is hidden, root cause analysis is incomplete. When analysis is incomplete, action items don't fix the real problem.
Instead of: "Who deployed this code without tests?"
We ask: "Why did the CI/CD pipeline allow deployment without running tests? What guardrails were missing?"
Instead of: "Why didn't you notice the alert for 40 minutes?"
We ask: "Why did the alert take 40 minutes to arrive? How is notification routing configured? Is there an escalation policy?"
Blameless doesn't mean "no accountability." Accountability shifts from individuals to systems. The engineer who deployed the bug isn't at fault. The system is at fault—for having no canary deployments, no automatic rollbacks, and no smoke tests after deployment.
When to Run a PIR
Not every incident needs a formal PIR. If you review every P4, your team will burn out and the process loses value. Here are the criteria:
| Incident Type | Run PIR? |
|---|---|
| P1 (Critical)—full outage | Always |
| P2 (Major)—significant degradation | Always |
| P3 with customer impact > 15 minutes | Yes |
| Near-miss (almost became P1) | Yes |
| P3 with no user impact | Team decision |
| P4 (Low)—cosmetic issues | No |
The ideal time for a PIR is 24-48 hours after recovery. Too soon and the team is still stressed and details aren't collected. Too late and people forget nuances, logs rotate, Slack threads go to archive.
Who Should Attend
Getting the right attendees is half the battle. Too few people leaves gaps. Too many turns it into a presentation.
Required Attendees
The incident commander (or whoever coordinated the response). All engineers who directly participated in mitigation. The engineering lead for the affected service. If multiple teams were involved, the lead from each.
Optional Attendees
Product manager—for business impact context. Customer support lead—for understanding how to improve customer communication. SRE/platform team—if the incident exposed systemic issues (missing monitoring, infrastructure weakness).
Ideal size: 4-8 people. More than 10 and productivity drops. Others can read the document afterward.
PIR Meeting Structure
The meeting takes 30-60 minutes. Complex P1s with multiple teams might need 90 minutes, but no more. Here's the four-part structure:
1. Timeline Reconstruction (10-15 min)
Rebuild the sequence of events from first symptom to full recovery. Sources: monitoring data (when alerts fired, when detection happened), Slack logs, git history, deployment logs, status page updates.
Critical timeline points: when the problem actually started (not when discovered), when monitoring detected it, how long until first response, when active troubleshooting began, when root cause was found, when the fix was applied, when recovery was confirmed.
With multi-region monitoring, detection and recovery timestamps come directly from the incident timeline—no need to reconstruct from memory.
2. Root Cause Analysis (15-20 min)
The "5 Whys" technique is a proven way to get to root cause. Each "why" should dig deeper, from symptom to systemic cause.
Problem: API returned 500 errors for 47 minutes
Why 1: Why was the API returning 500?
→ Database connection pool was exhausted.
Why 2: Why was the connection pool exhausted?
→ One slow query held connections for 30 seconds instead of 200ms.
Why 3: Why did the query become slow?
→ Yesterday we deployed a migration that dropped an index.
Why 4: Why did the migration drop the index?
→ Django auto-generated the migration with DROP INDEX,
and nobody caught it in code review.
Why 5: Why wasn't this caught?
→ No automated check for destructive migration operations.
The review checklist doesn't include SQL plan validation.
Root cause: Missing automated migration review in CI pipeline. Notice: root cause is not "someone deployed a bad migration." Root cause is the missing automated check. That's a systemic problem you can fix.
3. What Went Well / What Didn't (10 min)
Don't skip the positive part. If detection time was 2 minutes instead of the usual 10—that's a win. Understand what worked so you can replicate it. If rollback completed in 3 minutes, document which processes made that possible.
What didn't work: runbook was outdated, escalation went to the wrong person, monitoring didn't cover the failed endpoint, rollback required manual steps instead of being automated. Each becomes a potential action item.
4. Action Items (10 min)
Every action item needs three attributes: description, owner, and deadline. "Improve monitoring" is not an action item. "Add alert when connection pool utilization exceeds 80%—@alex—by April 15" is an action item.
Action item categories: prevent (stop it happening again), detect (catch it faster), respond (handle it better), mitigate (reduce impact).
Post-Incident Review Template
A ready-to-use template for Notion, Confluence, Google Docs, or any wiki. Fill it in during or immediately after the PIR meeting.
# Post-Incident Review: [Incident Name]
## Summary
- **Date**: YYYY-MM-DD
- **Duration**: X hours Y minutes
- **Severity**: P1 / P2 / P3
- **Incident Commander**: @name
- **Author**: @name
- **Status**: Draft / Reviewed / Final
## Impact
- Users affected: [number or percentage]
- Revenue impact: [if applicable]
- SLA impact: [was SLA breached, which one]
- Customer communications sent: [yes/no, how many]
## Timeline (UTC)
| Time | Event |
|-------|------------------------------------------|
| 14:00 | Deployment of v2.3.1 started |
| 14:05 | First 500 errors in monitoring |
| 14:07 | Alert fired, on-call paged |
| 14:12 | IC confirmed, incident channel opened |
| 14:25 | Root cause identified: missing index |
| 14:30 | Rollback initiated |
| 14:33 | Rollback complete, errors clearing |
| 14:45 | Monitoring confirms full recovery |
## Detection
- How was the incident detected? [monitoring alert / user report / internal]
- Time from start to detection: [X minutes]
- Time from detection to first response: [X minutes]
## Root Cause
[5 Whys analysis or detailed explanation]
## Contributing Factors
- [Factor 1: what made it worse]
- [Factor 2: what delayed recovery]
## What Went Well
- [What worked: quick detection, effective rollback, etc.]
## What Didn't Go Well
- [What failed: outdated runbook, slow escalation, etc.]
## Action Items
| # | Action | Owner | Deadline | Status |
|---|---------------------------------|--------|------------|--------|
| 1 | Add migration lint to CI | @alex | 2026-04-01 | Open |
| 2 | Set up connection pool alerts | @maria | 2026-03-28 | Open |
| 3 | Update rollback runbook | @ivan | 2026-04-05 | Open |
| 4 | Add canary deployment stage | @alex | 2026-04-15 | Open |
## Lessons Learned
[2-3 key takeaways] Action Items: From Document to Results
Action items are the most valuable part of a PIR—and the most fragile. According to Google SRE, around 40% of post-mortem action items don't get completed on time without active tracking.
Rules for Effective Action Items
Every action item should be SMART: specific (not "improve," but "add"), measurable (clear definition of done), owned, realistic, with a deadline.
Log action items in your regular issue tracker—Jira, Linear, GitHub Issues. If they live only in the PIR document, they won't be seen during sprint planning and won't get prioritized. Use a label (like pir-action) to track and filter them.
Review open PIR action items weekly—a 5-minute block in standup or a section in your SRE sync. If an action item stalls for two weeks, escalate to the engineering manager.
How Monitoring Data Helps PIRs
Quality monitoring turns a PIR from "we think it was like this" into "the data shows exactly what happened." Key metrics for PIRs:
Time to Detect (TTD)—from problem start to alert. If TTD exceeds 5 minutes on P1, you need more health checks or faster check intervals.
Time to Acknowledge (TTA)—from alert to first response. If TTA exceeds 15 minutes, there's a problem in your on-call rotation or notification routing.
Time to Mitigate (TTM)—from first response to recovery. If TTM exceeds 30 minutes, you need better runbooks or automated rollback.
Recovery Confirmation—monitoring data confirms the service actually recovered, not just "seems to work." Multi-region checks provide objective confirmation.
Common PIR Mistakes
Blaming Individuals Instead of Systems
If an engineer leaves the PIR feeling blamed, the review failed. Next time, that engineer will hide context and root cause analysis will be incomplete. Facilitators should redirect: "We're not discussing who made a mistake. We're discussing why the system allowed that mistake."
No Follow-Through on Action Items
The most common failure. You run a great PIR, write solid action items, and a month later none are done. A PIR without execution is wasted time. If you're not willing to track action items, don't run a PIR at all—at least you won't have the false confidence that "we figured it out."
PIR Document as a Novel
Nobody reads a 15-page document. An effective PIR doc is 2-3 pages max. Timeline as a table, not prose. Root cause as 5 Whys, not a dissertation. Action items as a table with owner and deadline. Everything else goes in an appendix or as links to Slack threads and dashboards.
Skipping PIR for "Simple" Incidents
"We just restarted the server and it worked." Without a PIR, you don't know why the restart was needed. Memory leak? OOM killer? Zombie processes? First time is luck. Second time is a pattern. Third time is a systemic problem you don't understand because you never investigated.
Action Items Without Priorities
10 action items with equal priority means 0 action items will get done. Divide them: "prevent" (stop repeat, highest priority), "detect" (catch faster, medium), "improve" (process improvements, can wait). The top 2-3 "prevent" items must be done before the next sprint review.
Pre-PIR Checklist
1. Timeline draft is ready (from monitoring data + Slack logs)
2. All participants are invited, no scheduling conflicts
3. Facilitator assigned (not the incident commander—need fresh eyes)
4. PIR document template created and pre-filled
5. Access to dashboards, logs, and deployment history ready for participants
6. Severity and impact pre-assessed
7. Blameless approach mentioned in the invite
PIR as a Team Learning Tool
Teams that run regular PIRs don't just fail less—they grow faster. Every PIR gives engineers deeper understanding of the systems they build. A junior developer who participates in a P1 incident analysis learns more about production reality in one hour than from a month of reading docs.
Over time, PIR documents become a knowledge library. New team members read the last 5-10 PIRs and get the real story: which services are problematic, which failure patterns repeat, how the team solves problems. That's context you don't get from README files or architecture diagrams.
Related Resources
Incident Management A to Z — complete guide to managing incidents: from detection to closure.
Incident Severity Levels: P1-P4 Classification Guide — how to properly classify incidents.
Incident Communication Templates — templates for during-incident communication: status page, email, Slack.
On-Call Rotation Best Practices — building sustainable on-call without burnout.
FAQ
What is a post-incident review (PIR)?
A post-incident review is a structured meeting held after a significant incident to analyze what happened, why it happened, and how to prevent recurrence. Unlike blame-focused investigations, modern PIRs follow a blameless approach: they focus on systemic causes rather than individual mistakes. The output is a document with timeline, root cause analysis, contributing factors, and actionable follow-ups with owners and deadlines.
When should you conduct a post-incident review?
Run a PIR for every P1 and P2 incident. For P3 incidents, run one if customer impact exceeded 15 minutes or if the incident revealed a systemic weakness. Also consider a PIR for near-misses that could have escalated. Schedule the review 24-48 hours after resolution: soon enough that details are fresh, but late enough that the team has recovered from incident stress.
What is a blameless post-mortem?
A blameless post-mortem assumes that people acted with the best information available at the time. Instead of asking 'who caused this?', it asks 'what conditions allowed this to happen?' and 'what systemic changes prevent recurrence?' This doesn't mean no accountability—it means accountability shifts from individuals to systems. Engineers who deployed bad code aren't blamed; instead, the review asks why CI/CD didn't catch the issue, why rollback was slow, or why monitoring didn't detect the regression.
How long should a post-incident review meeting take?
30-60 minutes for most incidents. P1 incidents with complex timelines may need 90 minutes. If your PIR regularly exceeds 60 minutes, you're likely going too deep during the meeting itself. Focus the meeting on timeline alignment, root cause discussion, and action item assignment. Detailed writeup can happen asynchronously after the meeting. Never let a PIR run past 90 minutes—diminishing returns are severe.
Who should attend a post-incident review?
Required: incident commander, all responders who participated, engineering lead for the affected system. Optional: product manager (for customer impact context), customer support lead (for user communication learnings), SRE/platform team (for systemic patterns). Keep it under 10 people. Observers can read the document afterward. If someone wasn't involved in the incident or doesn't own a contributing system, they don't need to be in the room.
How do you ensure action items from PIRs actually get completed?
Three practices: (1) Every action item gets an owner and a deadline during the meeting—no unassigned items. (2) Track PIR action items in the same issue tracker as regular work (Jira, Linear, GitHub Issues)—not in a separate document that nobody checks. (3) Review open PIR action items weekly in team standup or SRE review. If action items consistently stall, escalate to engineering management. Incomplete PIR actions are a leading indicator of repeat incidents.