Инцидент случился. Мониторинг сработал. On-call инженер на месте. Теперь нужно сообщить клиентам — и сделать это правильно. Плохая коммуникация превращает 15-минутный outage в кризис доверия. Хорошая — показывает профессионализм и строит лояльность.
Ниже — готовые шаблоны для каждого этапа инцидента: от первого обнаружения до post-incident review. Адаптируйте под свой продукт и используйте.
Четыре стадии инцидента
1. Investigating — проблема обнаружена, причина неизвестна. Первое сообщение.
2. Identified — причина найдена, fix в работе.
3. Monitoring — fix применён, наблюдаем за восстановлением.
4. Resolved — проблема решена, сервис восстановлен.
Шаблоны для status page
Stage 1: Investigating
Title: [Component] — Degraded Performance / Partial Outage
When: В течение 5 минут после обнаружения
We are investigating reports of [increased error rates / slow response times /
unavailability] affecting [Component Name].
Some users may experience [specific impact: failed API calls / slow page loads /
inability to log in].
Our team is actively investigating. We will provide an update within 30 minutes.
Posted at: [timestamp] Ключевые элементы: что затронуто, какое влияние на пользователей, когда следующий update. Не называйте причину, если не уверены.
Stage 2: Identified
We have identified the cause of [the issue / degraded performance] affecting
[Component Name].
[Brief, non-technical explanation: A database issue is causing delays in
processing requests / A configuration change caused some API requests to fail].
Our team is implementing a fix. We expect service to be restored within
[estimated time].
Affected: [list of affected capabilities]
Not affected: [list of unaffected capabilities — helps reassure users]
Posted at: [timestamp] Ключевые элементы: причина (высокоуровневая), ETA, что НЕ затронуто (снижает тревогу).
Stage 3: Monitoring
A fix has been implemented for the [Component Name] issue. We are monitoring
the situation to confirm full recovery.
[Most / All] users should now be able to [access the service / process
requests] normally.
If you continue to experience issues, please [clear your cache / try again
in a few minutes / contact support].
We will post a final update once we confirm the issue is fully resolved.
Posted at: [timestamp] Stage 4: Resolved
The issue affecting [Component Name] has been resolved. All services are
operating normally.
Summary:
- Duration: [start time] to [end time] ([X] minutes)
- Impact: [brief description of what users experienced]
- Cause: [one-line root cause]
- Resolution: [one-line fix description]
We apologize for any inconvenience. A detailed post-incident review will
be published within [24/48] hours.
Posted at: [timestamp] Шаблоны для email-уведомлений
Incident notification (to subscribers)
Subject: [Service Name] — [Component] experiencing issues
Hi,
We're currently experiencing [brief description] affecting [Component].
What's happening: [1-2 sentences about the impact]
What we're doing: Our team is investigating and working on a resolution.
Current status: [Investigating / Identified / Monitoring]
You can follow live updates on our status page: [status page URL]
We'll send another update when the situation changes.
— [Your Company] Team Resolution notification
Subject: [Resolved] [Service Name] — [Component] issue resolved
Hi,
The issue affecting [Component] has been resolved. Services are back
to normal.
Duration: [X] minutes ([start] to [end])
Impact: [what users experienced]
Resolution: [brief non-technical explanation]
We apologize for any disruption. If you have questions, please contact
our support team at [email/link].
A detailed post-incident report will be available at: [link]
— [Your Company] Team Шаблоны для внутренней коммуникации
Slack: incident declared
🔴 INCIDENT — [P1/P2] — [Component] [down/degraded]
Impact: [what's broken, who's affected]
Detection: [how it was found — AtomPing alert / customer report / internal]
IC (Incident Commander): @[name]
Status page: [link]
War room: [Slack channel / Zoom link]
@on-call — please acknowledge Slack: periodic update
🟡 UPDATE — [Component] incident — [HH:MM]
Current status: [Identified / fix in progress]
Root cause: [technical explanation for the team]
Next steps: [what's being done now]
ETA: [estimated resolution time]
Blockers: [if any]
Next update in [15/30] minutes Post-incident summary template
POST-INCIDENT REVIEW — [Date] — [Component]
Duration: [start] to [end] ([X] minutes)
Severity: [P1/P2/P3]
Impact: [number of affected users/requests, revenue impact if applicable]
Detection: [how and when — AtomPing alert at HH:MM, customer report, etc.]
Time to detect: [X] minutes
Time to resolve: [X] minutes
TIMELINE:
[HH:MM] — AtomPing alert: [Component] HTTP check failed
[HH:MM] — On-call acknowledged, began investigation
[HH:MM] — Root cause identified: [description]
[HH:MM] — Fix deployed
[HH:MM] — Service confirmed restored
[HH:MM] — Status page updated to Resolved
ROOT CAUSE:
[2-3 paragraphs explaining what went wrong technically]
WHAT WENT WELL:
- [Detection was fast — 30s via AtomPing]
- [Runbook existed and was followed]
- [Status page kept customers informed]
WHAT COULD BE IMPROVED:
- [Detection: monitor X endpoint was missing]
- [Response: escalation was delayed by Y minutes]
- [Communication: first status page update was late]
ACTION ITEMS:
- [ ] [Specific fix to prevent recurrence] — Owner: @name — Due: [date]
- [ ] [Monitoring improvement] — Owner: @name — Due: [date]
- [ ] [Process improvement] — Owner: @name — Due: [date] Правила хорошей incident communication
Скорость важнее полноты: первое сообщение за 5 минут с «investigating» лучше, чем детальное сообщение через 30 минут тишины.
Конкретика важнее обтекаемости: «API requests returning 503 errors» лучше, чем «some users may experience issues».
Укажите, что НЕ затронуто: «Dashboard and API are unaffected» снижает панику. Клиенты думают о worst case, пока вы не скажете иначе.
ETA — с оговоркой: «We expect resolution within 30 minutes» лучше, чем «working on it». Но добавьте «we will update if this changes».
Не обвиняйте: «A configuration change caused...» вместо «An engineer accidentally...». Blameless culture начинается с public communication.
Автоматизация с AtomPing
Auto-detection: мониторинг обнаруживает outage → автоматический incident на status page. Ноль задержки между detection и первым update.
Auto-resolution: мониторинг подтверждает recovery → incident автоматически resolved. Status page обновляется.
Manual details: автоматика ставит «investigating». Инженер добавляет детали: причину, ETA, action items. Лучшее из обоих миров — скорость автоматики + контекст от человека.
Связанные материалы
Incident Management Guide — полный lifecycle от detection до post-mortem
Полное руководство по status pages — компоненты, дизайн, infrastructure
On-Call Best Practices — escalation policies и alert routing
Public vs Internal Status Pages — что показывать каждой аудитории
FAQ
How quickly should I post the first status page update?
Within 5 minutes of detection. The first update doesn't need root cause — just acknowledge the problem: what's affected, what you know so far, and that you're investigating. Silence during an outage is worse than incomplete information.
How often should I update the status page during an incident?
Every 15-30 minutes for active incidents. Even if nothing changed, post 'Still investigating, no new information' — it shows you're working on it. For P1 incidents with high visibility, update every 10-15 minutes.
What tone should I use in incident communications?
Professional, empathetic, and direct. Acknowledge impact ('We understand this affects your workflow'), be specific about what's broken, avoid blame ('our provider' vs 'a third-party issue'), and provide clear next steps. Never use corporate jargon like 'synergies' or 'leveraging' in crisis communication.
Should I explain the technical root cause to customers?
Keep it high-level on the public status page: 'A database issue is causing delays in order processing.' Save technical details (OOMKilled, connection pool exhaustion, replication lag) for the internal status page and post-mortem. Customers care about impact and resolution, not implementation details.
What's the difference between 'investigating' and 'identified'?
Investigating: you know something is wrong but don't know why. Identified: you found the root cause and are working on a fix. Monitoring: fix deployed, watching for confirmation. Resolved: confirmed fixed, service back to normal. These four stages give customers a clear progression.
Should I send a post-incident summary to customers?
Yes, for any incident lasting more than 15 minutes or affecting a significant portion of users. Include: what happened, timeline, root cause (simplified), what you did to fix it, and what you're doing to prevent recurrence. This builds trust and shows accountability.