Incident Communication Templates: What to Say During an Outage

Ready-to-use templates for incident communication: status page updates, customer emails, internal Slack messages, and post-mortem write-ups.

2026-03-26 · 12 min · Operations Guide

An incident happened. Monitoring fired. On-call engineer is on it. Now you need to communicate with customers—and do it right. Poor communication turns a 15-minute outage into a trust crisis. Good communication shows professionalism and builds loyalty.

Below are ready-to-use templates for each incident stage: from detection to post-incident review. Adapt to your product and use.

Four Incident Stages

1. Investigating — problem detected, cause unknown. First message.

2. Identified — cause found, fix in progress.

3. Monitoring — fix deployed, watching recovery.

4. Resolved — issue resolved, service restored.

Status Page Templates

Stage 1: Investigating

Title: [Component] — Degraded Performance / Partial Outage

When: Within 5 minutes of detection

We are investigating reports of [increased error rates / slow response times /
unavailability] affecting [Component Name].

Some users may experience [specific impact: failed API calls / slow page loads /
inability to log in].

Our team is actively investigating. We will provide an update within 30 minutes.

Posted at: [timestamp]

Key elements: what's affected, impact on users, when the next update is coming. Don't name the cause if you're unsure.

Stage 2: Identified

We have identified the cause of [the issue / degraded performance] affecting
[Component Name].

[Brief, non-technical explanation: A database issue is causing delays in
processing requests / A configuration change caused some API requests to fail].

Our team is implementing a fix. We expect service to be restored within
[estimated time].

Affected: [list of affected capabilities]
Not affected: [list of unaffected capabilities — helps reassure users]

Posted at: [timestamp]

Key elements: cause (high-level), ETA, what's not affected (reduces panic).

Stage 3: Monitoring

A fix has been implemented for the [Component Name] issue. We are monitoring
the situation to confirm full recovery.

[Most / All] users should now be able to [access the service / process
requests] normally.

If you continue to experience issues, please [clear your cache / try again
in a few minutes / contact support].

We will post a final update once we confirm the issue is fully resolved.

Posted at: [timestamp]

Stage 4: Resolved

The issue affecting [Component Name] has been resolved. All services are
operating normally.

Summary:
- Duration: [start time] to [end time] ([X] minutes)
- Impact: [brief description of what users experienced]
- Cause: [one-line root cause]
- Resolution: [one-line fix description]

We apologize for any inconvenience. A detailed post-incident review will
be published within [24/48] hours.

Posted at: [timestamp]

Email Notification Templates

Incident Notification (to Subscribers)

Subject: [Service Name] — [Component] experiencing issues

Hi,

We're currently experiencing [brief description] affecting [Component].

What's happening: [1-2 sentences about the impact]
What we're doing: Our team is investigating and working on a resolution.
Current status: [Investigating / Identified / Monitoring]

You can follow live updates on our status page: [status page URL]

We'll send another update when the situation changes.

— [Your Company] Team

Resolution Notification

Subject: [Resolved] [Service Name] — [Component] issue resolved

Hi,

The issue affecting [Component] has been resolved. Services are back
to normal.

Duration: [X] minutes ([start] to [end])
Impact: [what users experienced]
Resolution: [brief non-technical explanation]

We apologize for any disruption. If you have questions, please contact
our support team at [email/link].

A detailed post-incident report will be available at: [link]

— [Your Company] Team

Internal Communication Templates

Slack: Incident Declared

🔴 INCIDENT — [P1/P2] — [Component] [down/degraded]

Impact: [what's broken, who's affected]
Detection: [how it was found — AtomPing alert / customer report / internal]
IC (Incident Commander): @[name]
Status page: [link]
War room: [Slack channel / Zoom link]

@on-call — please acknowledge

Slack: Periodic Update

🟡 UPDATE — [Component] incident — [HH:MM]

Current status: [Identified / fix in progress]
Root cause: [technical explanation for the team]
Next steps: [what's being done now]
ETA: [estimated resolution time]
Blockers: [if any]

Next update in [15/30] minutes

Post-Incident Summary Template

POST-INCIDENT REVIEW — [Date] — [Component]

Duration: [start] to [end] ([X] minutes)
Severity: [P1/P2/P3]
Impact: [number of affected users/requests, revenue impact if applicable]
Detection: [how and when — AtomPing alert at HH:MM, customer report, etc.]
Time to detect: [X] minutes
Time to resolve: [X] minutes

TIMELINE:
[HH:MM] — AtomPing alert: [Component] HTTP check failed
[HH:MM] — On-call acknowledged, began investigation
[HH:MM] — Root cause identified: [description]
[HH:MM] — Fix deployed
[HH:MM] — Service confirmed restored
[HH:MM] — Status page updated to Resolved

ROOT CAUSE:
[2-3 paragraphs explaining what went wrong technically]

WHAT WENT WELL:
- [Detection was fast — 30s via AtomPing]
- [Runbook existed and was followed]
- [Status page kept customers informed]

WHAT COULD BE IMPROVED:
- [Detection: monitor X endpoint was missing]
- [Response: escalation was delayed by Y minutes]
- [Communication: first status page update was late]

ACTION ITEMS:
- [ ] [Specific fix to prevent recurrence] — Owner: @name — Due: [date]
- [ ] [Monitoring improvement] — Owner: @name — Due: [date]
- [ ] [Process improvement] — Owner: @name — Due: [date]

Rules for Good Incident Communication

Speed over completeness: "investigating" in 5 minutes beats detailed message after 30 min silence.

Specificity over vagueness: "API returning 503 errors" beats "some users may experience issues".

Specify what's not affected: "Dashboard and API unaffected" reduces panic. Customers assume worst case until told otherwise.

ETA with caveat: "expect resolution in 30 min" beats "working on it". Add "we will update if this changes".

Don't blame: "configuration change caused..." not "engineer accidentally...". Blameless culture starts with public communication.

Automation with AtomPing

Auto-detection: monitoring finds outage → auto incident on status page. Zero delay between detection and first update.

Auto-resolution: monitoring confirms recovery → incident auto-resolved. Status page updates.

Manual details: automation sets "investigating". Engineer adds details: cause, ETA, actions. Best of both: speed + human context.

Related Resources

Incident Management Guide — full lifecycle from detection to post-mortem

Complete Status Pages Guide — components, design, infrastructure

On-Call Best Practices — escalation policies and alert routing

Public vs Internal Status Pages — what to show each audience

FAQ

How quickly should I post the first status page update?

Within 5 minutes of detection. The first update doesn't need root cause — just acknowledge the problem: what's affected, what you know so far, and that you're investigating. Silence during an outage is worse than incomplete information.

How often should I update the status page during an incident?

Every 15-30 minutes for active incidents. Even if nothing changed, post 'Still investigating, no new information' — it shows you're working on it. For P1 incidents with high visibility, update every 10-15 minutes.

What tone should I use in incident communications?

Professional, empathetic, and direct. Acknowledge impact ('We understand this affects your workflow'), be specific about what's broken, avoid blame ('our provider' vs 'a third-party issue'), and provide clear next steps. Never use corporate jargon like 'synergies' or 'leveraging' in crisis communication.

Should I explain the technical root cause to customers?

Keep it high-level on the public status page: 'A database issue is causing delays in order processing.' Save technical details (OOMKilled, connection pool exhaustion, replication lag) for the internal status page and post-mortem. Customers care about impact and resolution, not implementation details.

What's the difference between 'investigating' and 'identified'?

Investigating: you know something is wrong but don't know why. Identified: you found the root cause and are working on a fix. Monitoring: fix deployed, watching for confirmation. Resolved: confirmed fixed, service back to normal. These four stages give customers a clear progression.

Should I send a post-incident summary to customers?

Yes, for any incident lasting more than 15 minutes or affecting a significant portion of users. Include: what happened, timeline, root cause (simplified), what you did to fix it, and what you're doing to prevent recurrence. This builds trust and shows accountability.

Start monitoring your infrastructure

Start Free View Pricing

Monitoring

Features

Tools

Resources