Webhook Monitoring: Ensure Your Webhooks Never Fail

Complete guide to webhook monitoring: detecting delivery failures, implementing heartbeat patterns, idempotency, dead letter queues, and integrating with AtomPing HTTP checks.

2026-03-26 · 10 min · Technical Guide

Webhook — интеграционный хлеб в современных приложениях. Когда пользователь платит через Stripe, заказывает доставку через API, или авторизуется через OAuth — всё это приходит webhook'ом. Ваше приложение не опрашивает сервер, сервер сам вам сообщает о событии.

Проблема: webhook — это fire-and-forget. Отправитель пушит данные в ваш endpoint и забывает о нём. Если ваш endpoint падает, возвращает ошибку, или обрабатывает слишком медленно — отправитель не узнает. Failure бывает completely silent, пока не обнаружится, что платежи не обновились, статусы заказов зависли, или пользователи вообще не авторизовались. К тому времени уже часы потеряны.

Типичные сценарии failure'а webhook'ов

Scenario 1: endpoint down

Ваш webhook handler при деплое упал. Сервер пытается отправить webhook, получает Connection refused, пробует пересылать, потом сдаётся. Процесс платежа в Stripe отмечен как delivered, но ваша БД никогда не обновилась. Recommendation Stripe: всегда проверяйте статус платежа от Stripe, а webhook — это доп гарантия, а не единственный источник истины.

Scenario 2: endpoint slow

Ваш webhook handler делает тяжёлую операцию (большой запрос в БД, вызов трёх внешних API) и отвечает за 45 секунд. Stripe timeout = 30 секунд. Stripe получает timeout, думает что failure, ретрайит. Ваш endpoint тем временем ещё обрабатывает первый webhook, и когда приходит второй — вы обрабатываете платёж дважды (если нет идемпотентности).

Scenario 3: malformed payload

Stripe изменил schema payload'а. Добавил новое обязательное поле, или переименовал старое. Ваш parser ломается: KeyError: "customer_id". Вы возвращаете 500. Stripe видит 500, ретрайит спустя 5 минут, получает 500 снова. После нескольких попыток Stripe даёт up и перестаёт отправлять webhook'и с таким payload'ом. Failure молчит, ловится только в логах если их читать.

Scenario 4: processing error

Webhook received, parsed успешно, но обновление БД упало с unique constraint violation. Вы возвращаете 500. Stripe ретрайит. Но проблема не в сети — это логическая ошибка. Ретрайл не поможет. Webhook обработается с ошибкой снова и снова. Между тем, очередь бэкенд-воркеров переполняется, и другие webhooks начинают падать из-за нехватки ресурсов.

Архитектура надёжного webhook handler'а

Step 1: быстрое квитирование (HTTP 202)

# Django / DRF
@api_view(['POST'])
def stripe_webhook(request):
    payload = request.body
    
    # Verify signature (Stripe требует это, иначе откроется XSS)
    sig_header = request.META.get('HTTP_STRIPE_SIGNATURE')
    try:
        event = stripe.Webhook.construct_event(
            payload, sig_header, STRIPE_ENDPOINT_SECRET
        )
    except (ValueError, stripe.error.SignatureVerificationError):
        return Response(
            {"error": "Invalid signature"},
            status=400
        )
    
    # Store raw event in DB immediately
    # Before any processing that might fail
    WebhookEvent.objects.create(
        event_id=event['id'],
        event_type=event['type'],
        payload=event,
        status='pending'
    )
    
    # Queue async processing
    process_webhook_event.delay(event['id'])
    
    # Return 202 immediately
    # Tell Stripe: event received, will process async
    return Response({"status": "received"}, status=202)

Ключевой момент: вернуть 200 (или 202) до любой обработки, которая может упасть. Сохраните webhook в БД, очередите на обработку, ответьте сразу. Stripe видит 202 и считает delivery успешным. Обработка происходит async, и если упадёт — это не разбивает webhook delivery.

Step 2: идемпотентная обработка

# Background worker
def process_webhook_event(event_id):
    event = WebhookEvent.objects.get(id=event_id)
    
    # Detect duplicate: webhook_id + type should be unique
    # Stripe sends same webhook multiple times if we reply slowly
    webhook_idempotency_key = f"{event.event_id}-{event.event_type}"
    
    # Check if already processed
    if ProcessedWebhook.objects.filter(
        webhook_id=event.event_id
    ).exists():
        print(f"Webhook {event.event_id} already processed, skipping")
        return
    
    try:
        if event.event_type == 'payment_intent.succeeded':
            handle_payment_success(event.payload)
        elif event.event_type == 'payment_intent.payment_failed':
            handle_payment_failure(event.payload)
        elif event.event_type == 'customer.subscription.updated':
            handle_subscription_update(event.payload)
        
        # Mark as processed ONLY after success
        ProcessedWebhook.objects.create(webhook_id=event.event_id)
        event.status = 'processed'
        event.save()
    except Exception as e:
        # Log error, add to dead letter queue
        # Don't update status to 'processed'
        event.status = 'failed'
        event.error = str(e)
        event.save()
        
        # Alert the team
        send_alert(f"Webhook processing failed: {event.event_id}")
        
        # Add to retry queue with exponential backoff
        process_webhook_event.apply_async(
            args=[event_id],
            countdown=300  # retry after 5 minutes
        )

Идемпотентность = обработка одного webhook'а N раз даёт тот же результат, что обработка один раз. Используйте webhook_id как уникальный идентификатор. Если webhook приходит дважды — обрабатываем только первый раз.

Step 3: dead letter queue для persistant failures

# DLQ model
class WebhookDeadLetter(models.Model):
    webhook_id = models.CharField(max_length=255, unique=True)
    event_type = models.CharField(max_length=100)
    payload = models.JSONField()
    error_message = models.TextField()
    retry_count = models.IntegerField(default=0)
    last_retry_at = models.DateTimeField(null=True)
    created_at = models.DateTimeField(auto_now_add=True)

# In exception handler:
if event.status_failed_count > 3:  # Failed 3+ times
    WebhookDeadLetter.objects.create(
        webhook_id=event.event_id,
        event_type=event.event_type,
        payload=event.payload,
        error_message=str(e),
        retry_count=3
    )
    # Send alert to ops team
    alert_dlq(f"Webhook {event.event_id} moved to DLQ")

Dead Letter Queue — place for webhook'и, которые не могут быть обработаны (логические ошибки, не временные failures). Туда не должны попадать webhook'и из-за crashed endpoint или сетевых ошибок — только из-за malformed payload или логических contradictions. DLQ становится dashboard, где ваша команда видит неразрешённые события и может их разобрать вручную.

Мониторинг webhook delivery с AtomPing

Паттерн 1: Heartbeat monitoring

Stripe (или любой sender) отправляет вам heartbeat event'ы каждые 60 секунд. This не реальный платёж, это просто ping. Ваше приложение логирует heartbeat. Внешний мониторинг (AtomPing) читает лог и проверяет: последний heartbeat не старше 120 секунд?

Конфигурация в AtomPing:

Тип: Heartbeat

URL: https://api.yourapp.com/health/last-webhook-heartbeat

Интервал мониторинга: 30 секунд

Expected interval: 120 секунд (webhook heartbeat должен приходить каждые 60s, допуск 2x)

Endpoint должен вернуть timestamp последнего heartbeat'а

Паттерн 2: HTTP endpoint monitoring

Создайте GET /health/webhooks endpoint, который возвращает статус обработки:

GET /health/webhooks

{
  "status": "healthy",
  "last_webhook_received_at": "2026-03-26T10:28:00Z",
  "pending_webhooks": 0,
  "failed_webhooks": 0,
  "dlq_count": 0,
  "worker_alive": true,
  "metrics": {
    "processed_last_hour": 42,
    "errors_last_hour": 1,
    "avg_processing_time_ms": 234
  }
}

Затем в AtomPing настройте HTTP check с JSON path assertions:

Assertions:

✓ $.status equals healthy

✓ $.pending_webhooks equals 0

✓ $.worker_alive equals true

✓ $.dlq_count lt 5 (alert if more than 5 in DLQ)

Паттерн 3: test webhook'ом себя

Periodically отправляйте test webhook'и из вашего приложения в себя. This позволяет:

✓ Проверить полный путь: handler → queue → worker → БД

✓ Измерить latency end-to-end

✓ Убедиться что идемпотентность работает (test webhook приходит дважды)

# Celery Beat task (runs every hour)
@periodic_task(run_every=crontab(minute=0))
def send_test_webhook():
    webhook_event = {
        'type': 'test_webhook',
        'timestamp': now(),
        'data': {'marker': 'internal_test'}
    }
    
    # POST to your own webhook endpoint
    response = requests.post(
        f"{INTERNAL_API_URL}/webhooks/stripe",
        json=webhook_event,
        timeout=10
    )
    
    # Check if received and processed
    processed = ProcessedWebhook.objects.filter(
        webhook_id=webhook_event['id']
    ).exists()
    
    if not processed or response.status_code != 202:
        alert("Test webhook failed", 
              f"Status: {response.status_code}, Processed: {processed}")

Retry strategies: отправитель vs получатель

Отправитель (Stripe) ретрайит

Stripe ретрайит webhook'и, если получает non-2xx ответ. Exponential backoff:

Попытка 1: immediately

Попытка 2: 5 минут спустя

Попытка 3: 30 минут спустя

Попытка 4: 2 часа спустя

Попытка 5: 5 часов спустя

Max: 3 дня, потом Stripe даёт up

Stripe's retry policy хорош для сетевых проблем (endpoint was down, network glitch), но плох для логических ошибок (malformed payload, processing error). Если ваша БД returns constraint violation — ретрайл не поможет, violation будет каждый раз.

Получатель (вы) ретрайит

Вы не должны ретрайить webhook на уровне HTTP response. Вместо этого:

1. Получили webhook → сохранили в БД → вернули 202

2. Queued обработку на рабочий процесс

3. Рабочий пробует обработать

4. Если fails → exponential backoff в queue (5 min → 30 min → 2 hours)

5. После N failures → Dead Letter Queue

Логирование и audit trail

Webhook — ответственные данные. Платежи, подписки, авторизация. Вы должны иметь полный audit trail:

class WebhookAuditLog(models.Model):
    webhook_id = models.CharField(max_length=255)
    received_at = models.DateTimeField(auto_now_add=True)
    processing_started_at = models.DateTimeField(null=True)
    processing_completed_at = models.DateTimeField(null=True)
    status = models.CharField(  # received, processing, success, failed
        max_length=20,
        choices=[...])
    http_status = models.IntegerField()
    payload_hash = models.CharField(max_length=64)  # SHA256
    response_time_ms = models.IntegerField()
    error_message = models.TextField(null=True)
    retry_count = models.IntegerField(default=0)
    
    class Meta:
        indexes = [
            models.Index(fields=['webhook_id']),
            models.Index(fields=['status']),
            models.Index(fields=['received_at']),
        ]

Используйте payload_hash вместо полного payload'а в аудит-логе (экономит место, соответствует regulations). Payload'ы сохраняйте в отдельной таблице.

Интеграция с test webhook'ом

Большинство webhook provider'ов (Stripe, GitHub, SendGrid) предоставляют кнопку "Send test webhook" в dashboard. Используйте эту кнопку для проверки перед запуском в production. Убедитесь:

✓ Ваш endpoint доступен (DNS разрешается, firewall открыт)

✓ Signature verification работает

✓ Payload распарсился правильно

✓ Webhook залогирован и виден в вашем логов

SSL certificate и webhook delivery

Webhook provider отправляет на HTTPS endpoint. Если ваш SSL certificate истёк — webhook delivery отказывает. Stripe пробует, получает TLS handshake failure, и дальше ретрайит. На этом webhook's delivery заканчивается.

Solution: мониторьте SSL certificate expiry. AtomPing предоставляет TLS checks для этого. Настройте alert за 30 дней до expiry.

Testing webhook'ов локально

Инструменты для тестирования

ngrok — expose локальный server на интернет. Используйте для local development webhook'ов

Stripe CLI — локальный Stripe webhook emulator. stripe listen --forward-to localhost:8000/webhooks

curl — manual testing. Отправьте сырой webhook'ом себе, проверьте handling

Load testing webhook'ов

Перед запуском в production — нагруженный тест. Запустите 1000 webhook'ов за 1 секунду и проверьте: endpoint не упал, очередь не переполнилась, worker'ы обработали без ошибок.

# Load testing with Apache Bench
ab -n 1000 -c 50 -p webhook.json \\
   -H "Content-Type: application/json" \\
   https://api.yourapp.com/webhooks/stripe

# Results:
# Requests per second: 95.2
# Avg response time: 52ms
# 95th percentile: 120ms
# Success: 100%

Checklist: надёжный webhook handler

Архитектура: HTTP 202 immediately, async processing, separate queue

Идемпотентность: webhook_id как unique identifier, detect duplicates

Error handling: log everything, use DLQ for persistent failures, alert team

Signature verification: verify отправителя (Stripe, GitHub, etc.)

Monitoring: heartbeat pattern, health endpoint с metrics, test webhook'ом себя

SSL: monitor certificate expiry, alert за 30 дней

Logging: full audit trail, payload_hash, retry_count, processing_time

Связанные материалы

Heartbeat Monitoring — паттерн для проверки periodic обработок

API Monitoring Guide — как мониторить REST endpoints

Health Check Endpoint Design — JSON path assertions для /health

SSL Certificate Monitoring — мониторить expiry

Reduce False Alarms — quorum confirmation для webhook'ов

Heartbeat Check — в AtomPing

FAQ

What is webhook monitoring?

Webhook monitoring is the practice of tracking whether webhooks are being delivered successfully to your endpoint. Instead of your application pulling data from a source, webhooks push notifications to your endpoint when events occur. Monitoring ensures every event is delivered, processed, and handled within expected response times. Without it, silent failures go undetected until users notice missing data.

Why do webhooks fail silently?

Webhooks are fire-and-forget by design. The sender pushes data to your endpoint and moves on — they don't constantly check if you received it. Failures are silent: your endpoint might be down, returning 500 errors, or taking 60+ seconds to respond, but the sender never knows. That's why you need both internal logging (did we receive it?) and external monitoring (does the sender keep delivering?).

How do you detect webhook delivery failures?

Two approaches: 1) Internal monitoring — log every webhook received, parse the log, and alert if expected webhooks don't arrive within time window (heartbeat pattern). 2) External monitoring — create a heartbeat endpoint (dummy webhook listener) that the sender pushes to, and monitor that endpoint stays healthy. This catches both delivery failures and endpoint crashes.

What causes webhook delivery failures?

Your endpoint is down or unreachable (DNS, firewall, TLS cert expired). Your endpoint returns non-200 status or takes too long (>30s) to respond. Your endpoint crashes mid-processing (unhandled exception). Payload structure changed (missing field, wrong format) causing parsing error. Network path is broken (routing issue, ISP blocks sender IP). Sender's retry logic is broken or disabled (gives up too soon).

What is the heartbeat pattern for webhook monitoring?

Sender continuously pushes heartbeat events to your endpoint (every 30-60 seconds) separate from real events. Your internal monitor watches the heartbeat log. If no heartbeat arrives within expected window, the sender's webhook delivery is broken. This separates 'sender can't reach us' (heartbeat fails) from 'we received it but processing is slow' (heartbeat arrives). AtomPing's Heartbeat check type monitors this pattern.

How should you handle webhook processing failures?

Separate receiving (HTTP 202 Accepted + queue) from processing (async worker). Store raw webhook in database immediately — return 200 before processing. Implement idempotency (process by webhook ID, detect duplicates). If processing fails, DON'T respond 500 — that triggers sender retry. Instead, log error, add to dead letter queue, and alert your team. Retry from queue using exponential backoff, not sender retries.

Start monitoring your infrastructure

Start Free View Pricing

Monitoring

Features

Tools