Incident Communication Templates: What to Say During an Outage

Ready-to-use templates for incident communication. Status page updates, customer emails, internal Slack messages, and post-incident summaries — with examples for every incident stage.

2026-03-26 · 12 min · Operations Guide

Инцидент случился. Мониторинг сработал. On-call инженер на месте. Теперь нужно сообщить клиентам — и сделать это правильно. Плохая коммуникация превращает 15-минутный outage в кризис доверия. Хорошая — показывает профессионализм и строит лояльность.

Ниже — готовые шаблоны для каждого этапа инцидента: от первого обнаружения до post-incident review. Адаптируйте под свой продукт и используйте.

Четыре стадии инцидента

1. Investigating — проблема обнаружена, причина неизвестна. Первое сообщение.

2. Identified — причина найдена, fix в работе.

3. Monitoring — fix применён, наблюдаем за восстановлением.

4. Resolved — проблема решена, сервис восстановлен.

Шаблоны для status page

Stage 1: Investigating

Title: [Component] — Degraded Performance / Partial Outage

When: В течение 5 минут после обнаружения

We are investigating reports of [increased error rates / slow response times /
unavailability] affecting [Component Name].

Some users may experience [specific impact: failed API calls / slow page loads /
inability to log in].

Our team is actively investigating. We will provide an update within 30 minutes.

Posted at: [timestamp]

Ключевые элементы: что затронуто, какое влияние на пользователей, когда следующий update. Не называйте причину, если не уверены.

Stage 2: Identified

We have identified the cause of [the issue / degraded performance] affecting
[Component Name].

[Brief, non-technical explanation: A database issue is causing delays in
processing requests / A configuration change caused some API requests to fail].

Our team is implementing a fix. We expect service to be restored within
[estimated time].

Affected: [list of affected capabilities]
Not affected: [list of unaffected capabilities — helps reassure users]

Posted at: [timestamp]

Ключевые элементы: причина (высокоуровневая), ETA, что НЕ затронуто (снижает тревогу).

Stage 3: Monitoring

A fix has been implemented for the [Component Name] issue. We are monitoring
the situation to confirm full recovery.

[Most / All] users should now be able to [access the service / process
requests] normally.

If you continue to experience issues, please [clear your cache / try again
in a few minutes / contact support].

We will post a final update once we confirm the issue is fully resolved.

Posted at: [timestamp]

Stage 4: Resolved

The issue affecting [Component Name] has been resolved. All services are
operating normally.

Summary:
- Duration: [start time] to [end time] ([X] minutes)
- Impact: [brief description of what users experienced]
- Cause: [one-line root cause]
- Resolution: [one-line fix description]

We apologize for any inconvenience. A detailed post-incident review will
be published within [24/48] hours.

Posted at: [timestamp]

Шаблоны для email-уведомлений

Incident notification (to subscribers)

Subject: [Service Name] — [Component] experiencing issues

Hi,

We're currently experiencing [brief description] affecting [Component].

What's happening: [1-2 sentences about the impact]
What we're doing: Our team is investigating and working on a resolution.
Current status: [Investigating / Identified / Monitoring]

You can follow live updates on our status page: [status page URL]

We'll send another update when the situation changes.

— [Your Company] Team

Resolution notification

Subject: [Resolved] [Service Name] — [Component] issue resolved

Hi,

The issue affecting [Component] has been resolved. Services are back
to normal.

Duration: [X] minutes ([start] to [end])
Impact: [what users experienced]
Resolution: [brief non-technical explanation]

We apologize for any disruption. If you have questions, please contact
our support team at [email/link].

A detailed post-incident report will be available at: [link]

— [Your Company] Team

Шаблоны для внутренней коммуникации

Slack: incident declared

🔴 INCIDENT — [P1/P2] — [Component] [down/degraded]

Impact: [what's broken, who's affected]
Detection: [how it was found — AtomPing alert / customer report / internal]
IC (Incident Commander): @[name]
Status page: [link]
War room: [Slack channel / Zoom link]

@on-call — please acknowledge

Slack: periodic update

🟡 UPDATE — [Component] incident — [HH:MM]

Current status: [Identified / fix in progress]
Root cause: [technical explanation for the team]
Next steps: [what's being done now]
ETA: [estimated resolution time]
Blockers: [if any]

Next update in [15/30] minutes

Post-incident summary template

POST-INCIDENT REVIEW — [Date] — [Component]

Duration: [start] to [end] ([X] minutes)
Severity: [P1/P2/P3]
Impact: [number of affected users/requests, revenue impact if applicable]
Detection: [how and when — AtomPing alert at HH:MM, customer report, etc.]
Time to detect: [X] minutes
Time to resolve: [X] minutes

TIMELINE:
[HH:MM] — AtomPing alert: [Component] HTTP check failed
[HH:MM] — On-call acknowledged, began investigation
[HH:MM] — Root cause identified: [description]
[HH:MM] — Fix deployed
[HH:MM] — Service confirmed restored
[HH:MM] — Status page updated to Resolved

ROOT CAUSE:
[2-3 paragraphs explaining what went wrong technically]

WHAT WENT WELL:
- [Detection was fast — 30s via AtomPing]
- [Runbook existed and was followed]
- [Status page kept customers informed]

WHAT COULD BE IMPROVED:
- [Detection: monitor X endpoint was missing]
- [Response: escalation was delayed by Y minutes]
- [Communication: first status page update was late]

ACTION ITEMS:
- [ ] [Specific fix to prevent recurrence] — Owner: @name — Due: [date]
- [ ] [Monitoring improvement] — Owner: @name — Due: [date]
- [ ] [Process improvement] — Owner: @name — Due: [date]

Правила хорошей incident communication

Скорость важнее полноты: первое сообщение за 5 минут с «investigating» лучше, чем детальное сообщение через 30 минут тишины.

Конкретика важнее обтекаемости: «API requests returning 503 errors» лучше, чем «some users may experience issues».

Укажите, что НЕ затронуто: «Dashboard and API are unaffected» снижает панику. Клиенты думают о worst case, пока вы не скажете иначе.

ETA — с оговоркой: «We expect resolution within 30 minutes» лучше, чем «working on it». Но добавьте «we will update if this changes».

Не обвиняйте: «A configuration change caused...» вместо «An engineer accidentally...». Blameless culture начинается с public communication.

Автоматизация с AtomPing

Auto-detection: мониторинг обнаруживает outage → автоматический incident на status page. Ноль задержки между detection и первым update.

Auto-resolution: мониторинг подтверждает recovery → incident автоматически resolved. Status page обновляется.

Manual details: автоматика ставит «investigating». Инженер добавляет детали: причину, ETA, action items. Лучшее из обоих миров — скорость автоматики + контекст от человека.

Связанные материалы

Incident Management Guide — полный lifecycle от detection до post-mortem

Полное руководство по status pages — компоненты, дизайн, infrastructure

On-Call Best Practices — escalation policies и alert routing

Public vs Internal Status Pages — что показывать каждой аудитории

FAQ

How quickly should I post the first status page update?

Within 5 minutes of detection. The first update doesn't need root cause — just acknowledge the problem: what's affected, what you know so far, and that you're investigating. Silence during an outage is worse than incomplete information.

How often should I update the status page during an incident?

Every 15-30 minutes for active incidents. Even if nothing changed, post 'Still investigating, no new information' — it shows you're working on it. For P1 incidents with high visibility, update every 10-15 minutes.

What tone should I use in incident communications?

Professional, empathetic, and direct. Acknowledge impact ('We understand this affects your workflow'), be specific about what's broken, avoid blame ('our provider' vs 'a third-party issue'), and provide clear next steps. Never use corporate jargon like 'synergies' or 'leveraging' in crisis communication.

Should I explain the technical root cause to customers?

Keep it high-level on the public status page: 'A database issue is causing delays in order processing.' Save technical details (OOMKilled, connection pool exhaustion, replication lag) for the internal status page and post-mortem. Customers care about impact and resolution, not implementation details.

What's the difference between 'investigating' and 'identified'?

Investigating: you know something is wrong but don't know why. Identified: you found the root cause and are working on a fix. Monitoring: fix deployed, watching for confirmation. Resolved: confirmed fixed, service back to normal. These four stages give customers a clear progression.

Should I send a post-incident summary to customers?

Yes, for any incident lasting more than 15 minutes or affecting a significant portion of users. Include: what happened, timeline, root cause (simplified), what you did to fix it, and what you're doing to prevent recurrence. This builds trust and shows accountability.

Start monitoring your infrastructure

Start Free View Pricing

Monitoring

Features

Tools