Webhooks are the lifeblood of modern integrations. When a user pays through Stripe, orders delivery through an API, or authorizes via OAuth — it all comes in via webhooks. Your application doesn't poll the server; the server notifies you about events.
The problem: webhooks are fire-and-forget. The sender pushes data to your endpoint and forgets about it. If your endpoint crashes, returns an error, or processes too slowly — the sender never knows. Failures are completely silent until you discover that payments weren't updated, order statuses are stuck, or users didn't authorize at all. By then, hours are lost.
Common Webhook Failure Scenarios
Scenario 1: Endpoint Down
Your webhook handler crashes during deployment. The server tries to deliver the webhook, gets Connection refused, retries, then gives up. The payment process in Stripe is marked as delivered, but your database never updates. Stripe's recommendation: always verify payment status directly from Stripe. The webhook is an extra guarantee, not the only source of truth.
Scenario 2: Endpoint Slow
Your webhook handler performs a heavy operation (large database query, calls to three external APIs) and responds in 45 seconds. Stripe timeout is 30 seconds. Stripe times out, thinks it's a failure, and retries. Meanwhile, your endpoint is still processing the first webhook. When the second arrives, you process the payment twice (if you lack idempotency).
Scenario 3: Malformed Payload
Stripe changes the payload schema. A new required field is added or an old one is renamed. Your parser breaks: KeyError: "customer_id". You return 500. Stripe sees 500, retries after 5 minutes, gets 500 again. After several attempts, Stripe gives up and stops sending webhooks with that payload. The failure is silent, only caught in logs if you read them.
Scenario 4: Processing Error
Webhook received and parsed successfully, but the database update fails with unique constraint violation. You return 500. Stripe retries. But the problem isn't the network — it's a logical error. Retrying won't help. The webhook will error again and again. Meanwhile, your backend worker queue fills up, and other webhooks start failing due to resource exhaustion.
Architecture of a Reliable Webhook Handler
Step 1: Fast Acknowledgment (HTTP 202)
# Django / DRF
@api_view(['POST'])
def stripe_webhook(request):
payload = request.body
# Verify signature (Stripe requires this, prevents XSS)
sig_header = request.META.get('HTTP_STRIPE_SIGNATURE')
try:
event = stripe.Webhook.construct_event(
payload, sig_header, STRIPE_ENDPOINT_SECRET
)
except (ValueError, stripe.error.SignatureVerificationError):
return Response(
{"error": "Invalid signature"},
status=400
)
# Store raw event in DB immediately
# Before any processing that might fail
WebhookEvent.objects.create(
event_id=event['id'],
event_type=event['type'],
payload=event,
status='pending'
)
# Queue async processing
process_webhook_event.delay(event['id'])
# Return 202 immediately
# Tell Stripe: event received, will process async
return Response({"status": "received"}, status=202) Key point: return 200 (or 202) before any processing that might fail. Save the webhook to the database, queue it for processing, and respond immediately. Stripe sees 202 and considers delivery successful. Processing happens asynchronously, and if it fails, it doesn't break webhook delivery.
Step 2: Idempotent Processing
# Background worker
def process_webhook_event(event_id):
event = WebhookEvent.objects.get(id=event_id)
# Detect duplicate: webhook_id + type should be unique
# Stripe sends same webhook multiple times if we reply slowly
webhook_idempotency_key = f"{event.event_id}-{event.event_type}"
# Check if already processed
if ProcessedWebhook.objects.filter(
webhook_id=event.event_id
).exists():
print(f"Webhook {event.event_id} already processed, skipping")
return
try:
if event.event_type == 'payment_intent.succeeded':
handle_payment_success(event.payload)
elif event.event_type == 'payment_intent.payment_failed':
handle_payment_failure(event.payload)
elif event.event_type == 'customer.subscription.updated':
handle_subscription_update(event.payload)
# Mark as processed ONLY after success
ProcessedWebhook.objects.create(webhook_id=event.event_id)
event.status = 'processed'
event.save()
except Exception as e:
# Log error, add to dead letter queue
# Don't update status to 'processed'
event.status = 'failed'
event.error = str(e)
event.save()
# Alert the team
send_alert(f"Webhook processing failed: {event.event_id}")
# Add to retry queue with exponential backoff
process_webhook_event.apply_async(
args=[event_id],
countdown=300 # retry after 5 minutes
)
Idempotency means processing a webhook N times produces the same result as processing it once. Use webhook_id as a unique identifier. If a webhook arrives twice, process it only the first time.
Step 3: Dead Letter Queue for Persistent Failures
# DLQ model
class WebhookDeadLetter(models.Model):
webhook_id = models.CharField(max_length=255, unique=True)
event_type = models.CharField(max_length=100)
payload = models.JSONField()
error_message = models.TextField()
retry_count = models.IntegerField(default=0)
last_retry_at = models.DateTimeField(null=True)
created_at = models.DateTimeField(auto_now_add=True)
# In exception handler:
if event.status_failed_count > 3: # Failed 3+ times
WebhookDeadLetter.objects.create(
webhook_id=event.event_id,
event_type=event.event_type,
payload=event.payload,
error_message=str(e),
retry_count=3
)
# Send alert to ops team
alert_dlq(f"Webhook {event.event_id} moved to DLQ") A Dead Letter Queue is a place for webhooks that cannot be processed (logical errors, not temporary failures). Webhooks should not end up there due to crashed endpoints or network errors — only due to malformed payloads or logical contradictions. The DLQ becomes a dashboard where your team sees unresolved events and can handle them manually.
Monitoring Webhook Delivery with AtomPing
Pattern 1: Heartbeat Monitoring
Stripe (or any sender) sends you heartbeat events every 60 seconds. These aren't real payments, just pings. Your application logs the heartbeat. External monitoring (AtomPing) reads the log and checks: is the latest heartbeat not older than 120 seconds?
AtomPing Configuration:
Type: Heartbeat
URL: https://api.yourapp.com/health/last-webhook-heartbeat
Check interval: 30 seconds
Expected interval: 120 seconds (webhook heartbeat should arrive every 60s, 2x tolerance)
Endpoint should return the timestamp of the last heartbeat
Pattern 2: HTTP Endpoint Monitoring
Create a GET /health/webhooks endpoint that returns processing status:
GET /health/webhooks
{
"status": "healthy",
"last_webhook_received_at": "2026-03-26T10:28:00Z",
"pending_webhooks": 0,
"failed_webhooks": 0,
"dlq_count": 0,
"worker_alive": true,
"metrics": {
"processed_last_hour": 42,
"errors_last_hour": 1,
"avg_processing_time_ms": 234
}
} Then configure in AtomPing an HTTP check with JSON path assertions:
Assertions:
✓ $.status equals healthy
✓ $.pending_webhooks equals 0
✓ $.worker_alive equals true
✓ $.dlq_count lt 5 (alert if more than 5 in DLQ)
Pattern 3: Self-test via Webhook
Periodically send test webhooks from your application to itself. This allows you to:
✓ Test full path: handler → queue → worker → database
✓ Measure end-to-end latency
✓ Verify idempotency works (test webhook arrives twice)
# Celery Beat task (runs every hour)
@periodic_task(run_every=crontab(minute=0))
def send_test_webhook():
webhook_event = {
'type': 'test_webhook',
'timestamp': now(),
'data': {'marker': 'internal_test'}
}
# POST to your own webhook endpoint
response = requests.post(
f"{INTERNAL_API_URL}/webhooks/stripe",
json=webhook_event,
timeout=10
)
# Check if received and processed
processed = ProcessedWebhook.objects.filter(
webhook_id=webhook_event['id']
).exists()
if not processed or response.status_code != 202:
alert("Test webhook failed",
f"Status: {response.status_code}, Processed: {processed}") Retry Strategies: Sender vs Recipient
Sender (Stripe) Retries
Stripe retries webhooks if it gets a non-2xx response. Exponential backoff:
Attempt 1: immediately
Attempt 2: 5 minutes later
Attempt 3: 30 minutes later
Attempt 4: 2 hours later
Attempt 5: 5 hours later
Max: 3 days, then Stripe gives up
Stripe's retry policy is good for network problems (endpoint was down, network glitch), but bad for logical errors (malformed payload, processing error). If your database returns a constraint violation — retrying won't help, the violation will happen every time.
Recipient (You) Retries
You shouldn't retry webhooks at the HTTP response level. Instead:
1. Received webhook → saved to DB → returned 202
2. Queued processing to background worker
3. Worker attempts to process
4. If fails → exponential backoff in queue (5 min → 30 min → 2 hours)
5. After N failures → Dead Letter Queue
Logging and Audit Trail
Webhooks are sensitive data. Payments, subscriptions, authorization. You need a complete audit trail:
class WebhookAuditLog(models.Model):
webhook_id = models.CharField(max_length=255)
received_at = models.DateTimeField(auto_now_add=True)
processing_started_at = models.DateTimeField(null=True)
processing_completed_at = models.DateTimeField(null=True)
status = models.CharField( # received, processing, success, failed
max_length=20,
choices=[...])
http_status = models.IntegerField()
payload_hash = models.CharField(max_length=64) # SHA256
response_time_ms = models.IntegerField()
error_message = models.TextField(null=True)
retry_count = models.IntegerField(default=0)
class Meta:
indexes = [
models.Index(fields=['webhook_id']),
models.Index(fields=['status']),
models.Index(fields=['received_at']),
]
Use payload_hash instead of full payload in audit logs (saves space, complies with regulations). Store payloads in a separate table.
Integration with Test Webhooks
Most webhook providers (Stripe, GitHub, SendGrid) provide a "Send test webhook" button in their dashboard. Use this button to verify before production launch. Ensure:
✓ Your endpoint is reachable (DNS resolves, firewall open)
✓ Signature verification works
✓ Payload parses correctly
✓ Webhook is logged and visible in your logs
SSL Certificate and Webhook Delivery
Webhook providers send to HTTPS endpoints. If your SSL certificate expires, webhook delivery fails. Stripe tries, gets a TLS handshake failure, and then retries. That's where webhook delivery ends.
Solution: monitor SSL certificate expiry. AtomPing provides TLS checks for this. Set alerts 30 days before expiry.
Testing Webhooks Locally
Testing Tools
ngrok — expose your local server to the internet. Use for webhook development
Stripe CLI — local Stripe webhook emulator. stripe listen --forward-to localhost:8000/webhooks
curl — manual testing. Send a raw webhook to yourself, verify handling
Load Testing Webhooks
Before production launch, run a load test. Fire 1000 webhooks in 1 second and verify: endpoint doesn't crash, queue doesn't overflow, workers process without errors.
# Load testing with Apache Bench
ab -n 1000 -c 50 -p webhook.json \
-H "Content-Type: application/json" \
https://api.yourapp.com/webhooks/stripe
# Results:
# Requests per second: 95.2
# Avg response time: 52ms
# 95th percentile: 120ms
# Success: 100% Checklist: Reliable Webhook Handler
Architecture: HTTP 202 immediately, async processing, separate queue
Idempotency: webhook_id as unique identifier, detect duplicates
Error handling: log everything, use DLQ for persistent failures, alert team
Signature verification: verify sender (Stripe, GitHub, etc.)
Monitoring: heartbeat pattern, health endpoint with metrics, test yourself with webhooks
SSL: monitor certificate expiry, alert 30 days before
Logging: full audit trail, payload_hash, retry_count, processing_time
Related Resources
Heartbeat Monitoring — pattern for monitoring periodic tasks
API Monitoring Guide — how to monitor REST endpoints
Health Check Endpoint Design — JSON path assertions for /health
SSL Certificate Monitoring — monitor expiry
Reduce False Alarms — quorum confirmation for webhooks
Heartbeat Check — in AtomPing
FAQ
What is webhook monitoring?
Webhook monitoring is the practice of tracking whether webhooks are being delivered successfully to your endpoint. Instead of your application pulling data from a source, webhooks push notifications to your endpoint when events occur. Monitoring ensures every event is delivered, processed, and handled within expected response times. Without it, silent failures go undetected until users notice missing data.
Why do webhooks fail silently?
Webhooks are fire-and-forget by design. The sender pushes data to your endpoint and moves on — they don't constantly check if you received it. Failures are silent: your endpoint might be down, returning 500 errors, or taking 60+ seconds to respond, but the sender never knows. That's why you need both internal logging (did we receive it?) and external monitoring (does the sender keep delivering?).
How do you detect webhook delivery failures?
Two approaches: 1) Internal monitoring — log every webhook received, parse the log, and alert if expected webhooks don't arrive within time window (heartbeat pattern). 2) External monitoring — create a heartbeat endpoint (dummy webhook listener) that the sender pushes to, and monitor that endpoint stays healthy. This catches both delivery failures and endpoint crashes.
What causes webhook delivery failures?
Your endpoint is down or unreachable (DNS, firewall, TLS cert expired). Your endpoint returns non-200 status or takes too long (>30s) to respond. Your endpoint crashes mid-processing (unhandled exception). Payload structure changed (missing field, wrong format) causing parsing error. Network path is broken (routing issue, ISP blocks sender IP). Sender's retry logic is broken or disabled (gives up too soon).
What is the heartbeat pattern for webhook monitoring?
Sender continuously pushes heartbeat events to your endpoint (every 30-60 seconds) separate from real events. Your internal monitor watches the heartbeat log. If no heartbeat arrives within expected window, the sender's webhook delivery is broken. This separates 'sender can't reach us' (heartbeat fails) from 'we received it but processing is slow' (heartbeat arrives). AtomPing's Heartbeat check type monitors this pattern.
How should you handle webhook processing failures?
Separate receiving (HTTP 202 Accepted + queue) from processing (async worker). Store raw webhook in database immediately — return 200 before processing. Implement idempotency (process by webhook ID, detect duplicates). If processing fails, DON'T respond 500 — that triggers sender retry. Instead, log error, add to dead letter queue, and alert your team. Retry from queue using exponential backoff, not sender retries.