An incident happened. Monitoring fired. On-call engineer is on it. Now you need to communicate with customers—and do it right. Poor communication turns a 15-minute outage into a trust crisis. Good communication shows professionalism and builds loyalty.
Below are ready-to-use templates for each incident stage: from detection to post-incident review. Adapt to your product and use.
Four Incident Stages
1. Investigating — problem detected, cause unknown. First message.
2. Identified — cause found, fix in progress.
3. Monitoring — fix deployed, watching recovery.
4. Resolved — issue resolved, service restored.
Status Page Templates
Stage 1: Investigating
Title: [Component] — Degraded Performance / Partial Outage
When: Within 5 minutes of detection
We are investigating reports of [increased error rates / slow response times /
unavailability] affecting [Component Name].
Some users may experience [specific impact: failed API calls / slow page loads /
inability to log in].
Our team is actively investigating. We will provide an update within 30 minutes.
Posted at: [timestamp] Key elements: what's affected, impact on users, when the next update is coming. Don't name the cause if you're unsure.
Stage 2: Identified
We have identified the cause of [the issue / degraded performance] affecting
[Component Name].
[Brief, non-technical explanation: A database issue is causing delays in
processing requests / A configuration change caused some API requests to fail].
Our team is implementing a fix. We expect service to be restored within
[estimated time].
Affected: [list of affected capabilities]
Not affected: [list of unaffected capabilities — helps reassure users]
Posted at: [timestamp] Key elements: cause (high-level), ETA, what's not affected (reduces panic).
Stage 3: Monitoring
A fix has been implemented for the [Component Name] issue. We are monitoring
the situation to confirm full recovery.
[Most / All] users should now be able to [access the service / process
requests] normally.
If you continue to experience issues, please [clear your cache / try again
in a few minutes / contact support].
We will post a final update once we confirm the issue is fully resolved.
Posted at: [timestamp] Stage 4: Resolved
The issue affecting [Component Name] has been resolved. All services are
operating normally.
Summary:
- Duration: [start time] to [end time] ([X] minutes)
- Impact: [brief description of what users experienced]
- Cause: [one-line root cause]
- Resolution: [one-line fix description]
We apologize for any inconvenience. A detailed post-incident review will
be published within [24/48] hours.
Posted at: [timestamp] Email Notification Templates
Incident Notification (to Subscribers)
Subject: [Service Name] — [Component] experiencing issues
Hi,
We're currently experiencing [brief description] affecting [Component].
What's happening: [1-2 sentences about the impact]
What we're doing: Our team is investigating and working on a resolution.
Current status: [Investigating / Identified / Monitoring]
You can follow live updates on our status page: [status page URL]
We'll send another update when the situation changes.
— [Your Company] Team Resolution Notification
Subject: [Resolved] [Service Name] — [Component] issue resolved
Hi,
The issue affecting [Component] has been resolved. Services are back
to normal.
Duration: [X] minutes ([start] to [end])
Impact: [what users experienced]
Resolution: [brief non-technical explanation]
We apologize for any disruption. If you have questions, please contact
our support team at [email/link].
A detailed post-incident report will be available at: [link]
— [Your Company] Team Internal Communication Templates
Slack: Incident Declared
🔴 INCIDENT — [P1/P2] — [Component] [down/degraded]
Impact: [what's broken, who's affected]
Detection: [how it was found — AtomPing alert / customer report / internal]
IC (Incident Commander): @[name]
Status page: [link]
War room: [Slack channel / Zoom link]
@on-call — please acknowledge Slack: Periodic Update
🟡 UPDATE — [Component] incident — [HH:MM]
Current status: [Identified / fix in progress]
Root cause: [technical explanation for the team]
Next steps: [what's being done now]
ETA: [estimated resolution time]
Blockers: [if any]
Next update in [15/30] minutes Post-Incident Summary Template
POST-INCIDENT REVIEW — [Date] — [Component]
Duration: [start] to [end] ([X] minutes)
Severity: [P1/P2/P3]
Impact: [number of affected users/requests, revenue impact if applicable]
Detection: [how and when — AtomPing alert at HH:MM, customer report, etc.]
Time to detect: [X] minutes
Time to resolve: [X] minutes
TIMELINE:
[HH:MM] — AtomPing alert: [Component] HTTP check failed
[HH:MM] — On-call acknowledged, began investigation
[HH:MM] — Root cause identified: [description]
[HH:MM] — Fix deployed
[HH:MM] — Service confirmed restored
[HH:MM] — Status page updated to Resolved
ROOT CAUSE:
[2-3 paragraphs explaining what went wrong technically]
WHAT WENT WELL:
- [Detection was fast — 30s via AtomPing]
- [Runbook existed and was followed]
- [Status page kept customers informed]
WHAT COULD BE IMPROVED:
- [Detection: monitor X endpoint was missing]
- [Response: escalation was delayed by Y minutes]
- [Communication: first status page update was late]
ACTION ITEMS:
- [ ] [Specific fix to prevent recurrence] — Owner: @name — Due: [date]
- [ ] [Monitoring improvement] — Owner: @name — Due: [date]
- [ ] [Process improvement] — Owner: @name — Due: [date] Rules for Good Incident Communication
Speed over completeness: "investigating" in 5 minutes beats detailed message after 30 min silence.
Specificity over vagueness: "API returning 503 errors" beats "some users may experience issues".
Specify what's not affected: "Dashboard and API unaffected" reduces panic. Customers assume worst case until told otherwise.
ETA with caveat: "expect resolution in 30 min" beats "working on it". Add "we will update if this changes".
Don't blame: "configuration change caused..." not "engineer accidentally...". Blameless culture starts with public communication.
Automation with AtomPing
Auto-detection: monitoring finds outage → auto incident on status page. Zero delay between detection and first update.
Auto-resolution: monitoring confirms recovery → incident auto-resolved. Status page updates.
Manual details: automation sets "investigating". Engineer adds details: cause, ETA, actions. Best of both: speed + human context.
Related Resources
Incident Management Guide — full lifecycle from detection to post-mortem
Complete Status Pages Guide — components, design, infrastructure
On-Call Best Practices — escalation policies and alert routing
Public vs Internal Status Pages — what to show each audience
FAQ
How quickly should I post the first status page update?
Within 5 minutes of detection. The first update doesn't need root cause — just acknowledge the problem: what's affected, what you know so far, and that you're investigating. Silence during an outage is worse than incomplete information.
How often should I update the status page during an incident?
Every 15-30 minutes for active incidents. Even if nothing changed, post 'Still investigating, no new information' — it shows you're working on it. For P1 incidents with high visibility, update every 10-15 minutes.
What tone should I use in incident communications?
Professional, empathetic, and direct. Acknowledge impact ('We understand this affects your workflow'), be specific about what's broken, avoid blame ('our provider' vs 'a third-party issue'), and provide clear next steps. Never use corporate jargon like 'synergies' or 'leveraging' in crisis communication.
Should I explain the technical root cause to customers?
Keep it high-level on the public status page: 'A database issue is causing delays in order processing.' Save technical details (OOMKilled, connection pool exhaustion, replication lag) for the internal status page and post-mortem. Customers care about impact and resolution, not implementation details.
What's the difference between 'investigating' and 'identified'?
Investigating: you know something is wrong but don't know why. Identified: you found the root cause and are working on a fix. Monitoring: fix deployed, watching for confirmation. Resolved: confirmed fixed, service back to normal. These four stages give customers a clear progression.
Should I send a post-incident summary to customers?
Yes, for any incident lasting more than 15 minutes or affecting a significant portion of users. Include: what happened, timeline, root cause (simplified), what you did to fix it, and what you're doing to prevent recurrence. This builds trust and shows accountability.