Overview
At Syntech Innovation, I led backend development for a production LPG delivery and payments platform serving 10,000+ active users. The core challenge was building a payment processing layer that could handle webhooks from multiple payment providers (Paystack and Flutterwave) without ever losing or duplicating a transaction.
Problem
The existing system had several critical reliability gaps:
- Duplicate processing: Webhook retries from payment providers occasionally resulted in double-crediting user wallets
- Lost events: Network failures between the payment provider and our servers meant some successful payments were never recorded
- No observability: When payment discrepancies occurred, there was no audit trail to diagnose root causes
- Provider coupling: Business logic was tightly coupled to Paystack's webhook format, making Flutterwave integration painful
The platform was processing ~2,000 transactions per day, and even a 1% failure rate meant 20 angry customers daily.
Architecture
The pipeline was designed with five key stages, each with its own failure handling strategy.
Data Flow
1. Webhook Ingestion
Every incoming webhook is immediately acknowledged with a 200 OK after signature verification. This prevents payment providers from retrying prematurely.
@Post('webhook/paystack')
async handlePaystack(@Req() req: Request, @Res() res: Response) {
const signature = req.headers['x-paystack-signature'];
if (!this.verifyPaystackSignature(req.body, signature)) {
return res.status(401).send();
}
// ACK immediately, process async
res.status(200).send('OK');
await this.eventQueue.add('payment', req.body);
}
2. Event Normalization
A provider-agnostic event envelope ensures the downstream processor never knows (or cares) whether the payment came from Paystack or Flutterwave.
3. Idempotency Layer
Every event is keyed by provider + provider_reference. Before processing, we check Redis for the key:
- Key exists: Return cached result, skip processing
- Key missing: Acquire distributed lock, process, cache result with 72h TTL
4. Transaction Processing
Within a PostgreSQL transaction: update wallet balance, record the ledger entry, and mark the event as processed. If any step fails, the entire transaction rolls back.
5. Dead Letter Queue
Events that fail 3 times land in a DLQ. An alerting system notifies the on-call engineer with full event context.
API Design
The webhook endpoints follow a consistent pattern:
| Endpoint | Method | Auth | Purpose |
|----------|--------|------|---------|
| /webhook/paystack | POST | HMAC signature | Paystack event ingestion |
| /webhook/flutterwave | POST | Secret hash | Flutterwave event ingestion |
| /admin/webhooks | GET | JWT + RBAC | Event audit log |
| /admin/webhooks/:id/retry | POST | JWT + RBAC | Manual retry from DLQ |
Scaling Strategy
- Horizontal scaling: Stateless webhook receivers behind a load balancer — add more pods under load
- Redis-backed queue: Decouples ingestion from processing; sustained throughput of 500 events/sec per worker
- Connection pooling: PgBouncer manages PostgreSQL connections, preventing exhaustion during traffic spikes
- Rate limiting: Per-IP rate limiting on webhook endpoints to mitigate replay attacks
Reliability
- 99.9% uptime maintained across 6 months of production operation
- Zero duplicate payments after deploying the idempotency layer
- < 500ms p99 latency from webhook receipt to wallet credit
- Automatic retry with exponential backoff (1s, 4s, 16s) before DLQ routing
Security
- HMAC signature verification on every incoming webhook (provider-specific)
- IP allowlisting for known Paystack/Flutterwave egress ranges
- Encrypted at rest: All payment data encrypted in PostgreSQL using AES-256
- Audit logging: Every state mutation is logged with actor, timestamp, and before/after values
- JWT + RBAC on admin endpoints with role-based scoping
Trade-offs
| Decision | Benefit | Cost | |----------|---------|------| | Async processing via queue | Higher throughput, decoupled | Slightly higher latency (~200ms) | | 72h idempotency TTL | Catches late retries | Redis memory usage (~50MB) | | Normalized event format | Provider-agnostic logic | Extra mapping layer to maintain | | PostgreSQL for ledger | ACID guarantees | Vertical scaling ceiling |
Lessons Learned
-
ACK first, process later — Never do heavy work in the webhook handler itself. Payment providers will timeout and retry, creating the exact duplicates you're trying to prevent.
-
Idempotency is non-negotiable — In financial systems, "at least once" delivery from providers means your system must guarantee "exactly once" processing.
-
Normalize early — Adding Flutterwave took 2 days instead of 2 weeks because the processor was already provider-agnostic.
-
DLQs need dashboards — A dead letter queue nobody monitors is just a graveyard. We built an admin panel that made failed events visible and retryable.
-
Test with chaos — We regularly killed Redis pods and database connections during staging to verify graceful degradation. Every failure mode should have a known recovery path.