Webhook Monitoring: Ensure Your Webhooks Never Fail

Complete guide to webhook monitoring: detecting delivery failures, heartbeat patterns, idempotency, dead-letter queues, and retry strategies that work.

2026-03-26 · 10 min · Technical Guide

Webhooks are the lifeblood of modern integrations. When a user pays through Stripe, orders delivery through an API, or authorizes via OAuth — it all comes in via webhooks. Your application doesn't poll the server; the server notifies you about events.

The problem: webhooks are fire-and-forget. The sender pushes data to your endpoint and forgets about it. If your endpoint crashes, returns an error, or processes too slowly — the sender never knows. Failures are completely silent until you discover that payments weren't updated, order statuses are stuck, or users didn't authorize at all. By then, hours are lost.

Common Webhook Failure Scenarios

Scenario 1: Endpoint Down

Your webhook handler crashes during deployment. The server tries to deliver the webhook, gets Connection refused, retries, then gives up. The payment process in Stripe is marked as delivered, but your database never updates. Stripe's recommendation: always verify payment status directly from Stripe. The webhook is an extra guarantee, not the only source of truth.

Scenario 2: Endpoint Slow

Your webhook handler performs a heavy operation (large database query, calls to three external APIs) and responds in 45 seconds. Stripe timeout is 30 seconds. Stripe times out, thinks it's a failure, and retries. Meanwhile, your endpoint is still processing the first webhook. When the second arrives, you process the payment twice (if you lack idempotency).

Scenario 3: Malformed Payload

Stripe changes the payload schema. A new required field is added or an old one is renamed. Your parser breaks: KeyError: "customer_id". You return 500. Stripe sees 500, retries after 5 minutes, gets 500 again. After several attempts, Stripe gives up and stops sending webhooks with that payload. The failure is silent, only caught in logs if you read them.

Scenario 4: Processing Error

Webhook received and parsed successfully, but the database update fails with unique constraint violation. You return 500. Stripe retries. But the problem isn't the network — it's a logical error. Retrying won't help. The webhook will error again and again. Meanwhile, your backend worker queue fills up, and other webhooks start failing due to resource exhaustion.

Architecture of a Reliable Webhook Handler

Step 1: Fast Acknowledgment (HTTP 202)

# Django / DRF
@api_view(['POST'])
def stripe_webhook(request):
    payload = request.body
    
    # Verify signature (Stripe requires this, prevents XSS)
    sig_header = request.META.get('HTTP_STRIPE_SIGNATURE')
    try:
        event = stripe.Webhook.construct_event(
            payload, sig_header, STRIPE_ENDPOINT_SECRET
        )
    except (ValueError, stripe.error.SignatureVerificationError):
        return Response(
            {"error": "Invalid signature"},
            status=400
        )
    
    # Store raw event in DB immediately
    # Before any processing that might fail
    WebhookEvent.objects.create(
        event_id=event['id'],
        event_type=event['type'],
        payload=event,
        status='pending'
    )
    
    # Queue async processing
    process_webhook_event.delay(event['id'])
    
    # Return 202 immediately
    # Tell Stripe: event received, will process async
    return Response({"status": "received"}, status=202)

Key point: return 200 (or 202) before any processing that might fail. Save the webhook to the database, queue it for processing, and respond immediately. Stripe sees 202 and considers delivery successful. Processing happens asynchronously, and if it fails, it doesn't break webhook delivery.

Step 2: Idempotent Processing

# Background worker
def process_webhook_event(event_id):
    event = WebhookEvent.objects.get(id=event_id)
    
    # Detect duplicate: webhook_id + type should be unique
    # Stripe sends same webhook multiple times if we reply slowly
    webhook_idempotency_key = f"{event.event_id}-{event.event_type}"
    
    # Check if already processed
    if ProcessedWebhook.objects.filter(
        webhook_id=event.event_id
    ).exists():
        print(f"Webhook {event.event_id} already processed, skipping")
        return
    
    try:
        if event.event_type == 'payment_intent.succeeded':
            handle_payment_success(event.payload)
        elif event.event_type == 'payment_intent.payment_failed':
            handle_payment_failure(event.payload)
        elif event.event_type == 'customer.subscription.updated':
            handle_subscription_update(event.payload)
        
        # Mark as processed ONLY after success
        ProcessedWebhook.objects.create(webhook_id=event.event_id)
        event.status = 'processed'
        event.save()
    except Exception as e:
        # Log error, add to dead letter queue
        # Don't update status to 'processed'
        event.status = 'failed'
        event.error = str(e)
        event.save()
        
        # Alert the team
        send_alert(f"Webhook processing failed: {event.event_id}")
        
        # Add to retry queue with exponential backoff
        process_webhook_event.apply_async(
            args=[event_id],
            countdown=300  # retry after 5 minutes
        )

Idempotency means processing a webhook N times produces the same result as processing it once. Use webhook_id as a unique identifier. If a webhook arrives twice, process it only the first time.

Step 3: Dead Letter Queue for Persistent Failures

# DLQ model
class WebhookDeadLetter(models.Model):
    webhook_id = models.CharField(max_length=255, unique=True)
    event_type = models.CharField(max_length=100)
    payload = models.JSONField()
    error_message = models.TextField()
    retry_count = models.IntegerField(default=0)
    last_retry_at = models.DateTimeField(null=True)
    created_at = models.DateTimeField(auto_now_add=True)

# In exception handler:
if event.status_failed_count > 3:  # Failed 3+ times
    WebhookDeadLetter.objects.create(
        webhook_id=event.event_id,
        event_type=event.event_type,
        payload=event.payload,
        error_message=str(e),
        retry_count=3
    )
    # Send alert to ops team
    alert_dlq(f"Webhook {event.event_id} moved to DLQ")

A Dead Letter Queue is a place for webhooks that cannot be processed (logical errors, not temporary failures). Webhooks should not end up there due to crashed endpoints or network errors — only due to malformed payloads or logical contradictions. The DLQ becomes a dashboard where your team sees unresolved events and can handle them manually.

Monitoring Webhook Delivery with AtomPing

Pattern 1: Heartbeat Monitoring

Stripe (or any sender) sends you heartbeat events every 60 seconds. These aren't real payments, just pings. Your application logs the heartbeat. External monitoring (AtomPing) reads the log and checks: is the latest heartbeat not older than 120 seconds?

AtomPing Configuration:

Type: Heartbeat

URL: https://api.yourapp.com/health/last-webhook-heartbeat

Check interval: 30 seconds

Expected interval: 120 seconds (webhook heartbeat should arrive every 60s, 2x tolerance)

Endpoint should return the timestamp of the last heartbeat

Pattern 2: HTTP Endpoint Monitoring

Create a GET /health/webhooks endpoint that returns processing status:

GET /health/webhooks

{
  "status": "healthy",
  "last_webhook_received_at": "2026-03-26T10:28:00Z",
  "pending_webhooks": 0,
  "failed_webhooks": 0,
  "dlq_count": 0,
  "worker_alive": true,
  "metrics": {
    "processed_last_hour": 42,
    "errors_last_hour": 1,
    "avg_processing_time_ms": 234
  }
}

Then configure in AtomPing an HTTP check with JSON path assertions:

Assertions:

✓ $.status equals healthy

✓ $.pending_webhooks equals 0

✓ $.worker_alive equals true

✓ $.dlq_count lt 5 (alert if more than 5 in DLQ)

Pattern 3: Self-test via Webhook

Periodically send test webhooks from your application to itself. This allows you to:

✓ Test full path: handler → queue → worker → database

✓ Measure end-to-end latency

✓ Verify idempotency works (test webhook arrives twice)

# Celery Beat task (runs every hour)
@periodic_task(run_every=crontab(minute=0))
def send_test_webhook():
    webhook_event = {
        'type': 'test_webhook',
        'timestamp': now(),
        'data': {'marker': 'internal_test'}
    }
    
    # POST to your own webhook endpoint
    response = requests.post(
        f"{INTERNAL_API_URL}/webhooks/stripe",
        json=webhook_event,
        timeout=10
    )
    
    # Check if received and processed
    processed = ProcessedWebhook.objects.filter(
        webhook_id=webhook_event['id']
    ).exists()
    
    if not processed or response.status_code != 202:
        alert("Test webhook failed", 
              f"Status: {response.status_code}, Processed: {processed}")

Retry Strategies: Sender vs Recipient

Sender (Stripe) Retries

Stripe retries webhooks if it gets a non-2xx response. Exponential backoff:

Attempt 1: immediately

Attempt 2: 5 minutes later

Attempt 3: 30 minutes later

Attempt 4: 2 hours later

Attempt 5: 5 hours later

Max: 3 days, then Stripe gives up

Stripe's retry policy is good for network problems (endpoint was down, network glitch), but bad for logical errors (malformed payload, processing error). If your database returns a constraint violation — retrying won't help, the violation will happen every time.

Recipient (You) Retries

You shouldn't retry webhooks at the HTTP response level. Instead:

1. Received webhook → saved to DB → returned 202

2. Queued processing to background worker

3. Worker attempts to process

4. If fails → exponential backoff in queue (5 min → 30 min → 2 hours)

5. After N failures → Dead Letter Queue

Logging and Audit Trail

Webhooks are sensitive data. Payments, subscriptions, authorization. You need a complete audit trail:

class WebhookAuditLog(models.Model):
    webhook_id = models.CharField(max_length=255)
    received_at = models.DateTimeField(auto_now_add=True)
    processing_started_at = models.DateTimeField(null=True)
    processing_completed_at = models.DateTimeField(null=True)
    status = models.CharField(  # received, processing, success, failed
        max_length=20,
        choices=[...])
    http_status = models.IntegerField()
    payload_hash = models.CharField(max_length=64)  # SHA256
    response_time_ms = models.IntegerField()
    error_message = models.TextField(null=True)
    retry_count = models.IntegerField(default=0)
    
    class Meta:
        indexes = [
            models.Index(fields=['webhook_id']),
            models.Index(fields=['status']),
            models.Index(fields=['received_at']),
        ]

Use payload_hash instead of full payload in audit logs (saves space, complies with regulations). Store payloads in a separate table.

Integration with Test Webhooks

Most webhook providers (Stripe, GitHub, SendGrid) provide a "Send test webhook" button in their dashboard. Use this button to verify before production launch. Ensure:

✓ Your endpoint is reachable (DNS resolves, firewall open)

✓ Signature verification works

✓ Payload parses correctly

✓ Webhook is logged and visible in your logs

SSL Certificate and Webhook Delivery

Webhook providers send to HTTPS endpoints. If your SSL certificate expires, webhook delivery fails. Stripe tries, gets a TLS handshake failure, and then retries. That's where webhook delivery ends.

Solution: monitor SSL certificate expiry. AtomPing provides TLS checks for this. Set alerts 30 days before expiry.

Testing Webhooks Locally

Testing Tools

ngrok — expose your local server to the internet. Use for webhook development

Stripe CLI — local Stripe webhook emulator. stripe listen --forward-to localhost:8000/webhooks

curl — manual testing. Send a raw webhook to yourself, verify handling

Load Testing Webhooks

Before production launch, run a load test. Fire 1000 webhooks in 1 second and verify: endpoint doesn't crash, queue doesn't overflow, workers process without errors.

# Load testing with Apache Bench
ab -n 1000 -c 50 -p webhook.json \
   -H "Content-Type: application/json" \
   https://api.yourapp.com/webhooks/stripe

# Results:
# Requests per second: 95.2
# Avg response time: 52ms
# 95th percentile: 120ms
# Success: 100%

Checklist: Reliable Webhook Handler

Architecture: HTTP 202 immediately, async processing, separate queue

Idempotency: webhook_id as unique identifier, detect duplicates

Error handling: log everything, use DLQ for persistent failures, alert team

Signature verification: verify sender (Stripe, GitHub, etc.)

Monitoring: heartbeat pattern, health endpoint with metrics, test yourself with webhooks

SSL: monitor certificate expiry, alert 30 days before

Logging: full audit trail, payload_hash, retry_count, processing_time

Related Resources

Heartbeat Monitoring — pattern for monitoring periodic tasks

API Monitoring Guide — how to monitor REST endpoints

Health Check Endpoint Design — JSON path assertions for /health

SSL Certificate Monitoring — monitor expiry

Reduce False Alarms — quorum confirmation for webhooks

Heartbeat Check — in AtomPing

FAQ

What is webhook monitoring?

Webhook monitoring is the practice of tracking whether webhooks are being delivered successfully to your endpoint. Instead of your application pulling data from a source, webhooks push notifications to your endpoint when events occur. Monitoring ensures every event is delivered, processed, and handled within expected response times. Without it, silent failures go undetected until users notice missing data.

Why do webhooks fail silently?

Webhooks are fire-and-forget by design. The sender pushes data to your endpoint and moves on — they don't constantly check if you received it. Failures are silent: your endpoint might be down, returning 500 errors, or taking 60+ seconds to respond, but the sender never knows. That's why you need both internal logging (did we receive it?) and external monitoring (does the sender keep delivering?).

How do you detect webhook delivery failures?

Two approaches: 1) Internal monitoring — log every webhook received, parse the log, and alert if expected webhooks don't arrive within time window (heartbeat pattern). 2) External monitoring — create a heartbeat endpoint (dummy webhook listener) that the sender pushes to, and monitor that endpoint stays healthy. This catches both delivery failures and endpoint crashes.

What causes webhook delivery failures?

Your endpoint is down or unreachable (DNS, firewall, TLS cert expired). Your endpoint returns non-200 status or takes too long (>30s) to respond. Your endpoint crashes mid-processing (unhandled exception). Payload structure changed (missing field, wrong format) causing parsing error. Network path is broken (routing issue, ISP blocks sender IP). Sender's retry logic is broken or disabled (gives up too soon).

What is the heartbeat pattern for webhook monitoring?

Sender continuously pushes heartbeat events to your endpoint (every 30-60 seconds) separate from real events. Your internal monitor watches the heartbeat log. If no heartbeat arrives within expected window, the sender's webhook delivery is broken. This separates 'sender can't reach us' (heartbeat fails) from 'we received it but processing is slow' (heartbeat arrives). AtomPing's Heartbeat check type monitors this pattern.

How should you handle webhook processing failures?

Separate receiving (HTTP 202 Accepted + queue) from processing (async worker). Store raw webhook in database immediately — return 200 before processing. Implement idempotency (process by webhook ID, detect duplicates). If processing fails, DON'T respond 500 — that triggers sender retry. Instead, log error, add to dead letter queue, and alert your team. Retry from queue using exponential backoff, not sender retries.

Start monitoring your infrastructure

Start Free View Pricing

Monitoring

Features

Tools

Resources