What problem does this system address?
Standard webhook implementations silently lose events when consumers are unavailable, slow, or returning errors. A production-grade webhook system must guarantee at-least-once delivery while protecting both the sender’s infrastructure and the consumer’s endpoint from overload.
I built this system for a payments platform that needed to notify 340 merchant integrations of transaction events. The previous implementation used fire-and-forget HTTP POST requests. If the consumer was unavailable, the event was lost. Merchants discovered missing events days later when their records diverged from the platform’s records. The trust cost was significant: 23% of merchants cited “unreliable notifications” as their primary integration complaint.
How is the system structured?
The system uses 5 components: a durable event queue, idempotency-keyed delivery, exponential backoff with jitter, a dead letter queue with manual replay, and consumer health scoring.
Component 1: Durable Event Queue
Every webhook event is persisted to a durable message queue (Kafka with replication factor 3) before any delivery attempt. The event remains in the queue until delivery is confirmed by the consumer (HTTP 2xx response) or the event exhausts its retry budget and moves to the dead letter queue. Events are retained for 72 hours regardless of delivery status, enabling replay if a consumer discovers missing events after the fact.
Component 2: Idempotency Keys
Every event carries a unique idempotency key (UUID v4) in both the payload and a custom header. Consumers are documented to use this key for deduplication. At-least-once delivery means some events will be delivered more than once (during retries after ambiguous failures). The idempotency key allows consumers to safely discard duplicates. In production, approximately 0.3% of deliveries are duplicates, all of which are harmless when consumers implement idempotency.
Component 3: Exponential Backoff with Jitter
Failed deliveries are retried with exponential backoff: 10 seconds, 30 seconds, 90 seconds, 270 seconds, up to a maximum of 1 hour between attempts, with random jitter of plus or minus 20% to prevent thundering herd effects when multiple consumers recover simultaneously. The retry budget is 15 attempts over 24 hours. This provides consumer endpoints with enough recovery time to handle transient outages, planned maintenance, and scaling events without losing data.
Component 4: Dead Letter Queue
Events that exhaust their retry budget are moved to a dead letter queue with the failure reason (last HTTP status code, timeout, connection refused). A monitoring dashboard shows dead letter queue depth per consumer. The operations team can replay dead letter events individually or in bulk after the consumer’s issue is resolved. In 12 months, the dead letter queue captured 2,100 events (0.0025% of total), all of which were eventually delivered after consumer remediation.
Component 5: Consumer Health Scoring
Each consumer endpoint has a health score based on its success rate over the last 24 hours. Endpoints with success rates below 10% are automatically suspended from delivery attempts to prevent the webhook system from wasting resources on consistently failing endpoints. Suspended consumers are probed with a health check every 5 minutes. When the probe succeeds, normal delivery resumes and queued events are delivered in order. This pattern protects the webhook infrastructure from cascading resource exhaustion when multiple consumers fail simultaneously.
How do you validate it works?
Validation uses end-to-end delivery confirmation, consumer-side reconciliation reports, and monthly chaos testing that simulates consumer failures at various stages.
Each delivery is confirmed by matching the consumer’s HTTP response against expected success codes. A daily reconciliation report compares events published to events confirmed delivered, flagging any discrepancies. Monthly chaos testing simulates consumer failures (timeout, 500 error, connection refused, DNS failure) to verify that the retry, dead letter, and health scoring mechanisms function correctly. The 99.97% delivery rate is measured over rolling 30-day periods, with the remaining 0.03% being events in active retry at the measurement time (they are eventually delivered). The system draws on principles from event-driven architecture and asynchronous systems.