At what scale does a webhook-driven SaaS architecture collapse and how should it be redesigned before that happens? We have 8 backend engineers on AWS. Current volume is 50 events/sec growing 3x/year. Considering Kafka, AWS EventBridge, or SQS+SNS. Need to handle 10x current load within 6 months.
Migrate to Apache Kafka via Confluent Cloud to handle the 500 events/sec target within 6 months. At 50 events/sec
Confidence
72%
Risk
unknown
625s
At what scale does a webhook-driven SaaS architecture collapse and how should it be redesigned before that happens?...
Decision
- Migrate to Apache Kafka via Confluent Cloud to handle the 500 events/sec target within 6 months. At 50 events/sec, direct webhook fan-out works. At 500 events/sec, webhook retry storms, endpoint timeouts, and cascading failures collapse the architecture — the dispatcher becomes the bottleneck. Kafka's partitioned commit log absorbs bursts durably, decouples producers from consumers, and scales horizontally via consumer groups. Use Confluent Cloud (not self-managed Kafka) to keep operational burden manageable for 8 engineers. Estimated cost: ~$12,000/year. Key failure mode: operational complexity still risks overwhelming the team — partition management, consumer group rebalancing, and offset tracking require dedicated learning investment. Mitigate by starting with a small number of partitions (8-
- and expanding only when throughput demands it. Second failure mode: under-provisioning the cluster leads to latency spikes under burst loads. Size for 2x the 500 events/sec target (1,000 events/sec peak capacity) to absorb growth and burst traffic. Retain webhooks as the delivery mechanism to downstream consumers but buffer through Kafka, converting synchronous fan-out into async consumer pulls.
Next actions
Deploy Confluent Cloud Basic cluster in existing AWS region, create primary event topic with 8 partitions and 7-day retention
Build shadow-mode producer that mirrors 10% of live webhook traffic to Kafka topic; measure P50/P95/P99 end-to-end latency over 72 hours
Audit all downstream webhook consumers for hard latency SLA requirements — document which endpoints require sub-100ms delivery vs. which tolerate seconds of delay
Build consumer service that reads from Kafka and delivers webhooks with retry/DLQ logic, replacing direct fan-out
Set up alerting on consumer lag, partition rebalance events, and producer error rates in Confluent Cloud dashboard
This verdict stops being true when
If downstream webhook consumers have hard sub-50ms delivery SLAs that cannot be renegotiated, and the PoC validates that Kafka adds >50ms latency → Use SQS+SNS for async fan-out (lower latency, simpler ops) or keep direct webhooks with horizontal scaling of dispatcher nodes behind a load balancer
If growth rate slows and 500 events/sec target is revised down to <200 events/sec, and team lacks Kafka expertise → Use AWS EventBridge as serverless event router — sufficient throughput at that scale with near-zero operational overhead for 8 engineers
If the team cannot staff 2+ engineers for Kafka migration without halting feature development → Implement SQS+SNS fan-out as a simpler intermediate step that buys 12-18 months of headroom before needing Kafka
Full council reasoning, attack grid, and flip conditions included with Pro
Council notes
Vulcan
Socrates
Loki
Assumptions
- Current webhook architecture is synchronous fan-out where the producer directly calls each subscriber endpoint, creating O(subscribers × events) HTTP calls
- The 8-engineer team can allocate 2-3 engineers to the Kafka migration while maintaining feature velocity with the remaining 5-6
- AWS infrastructure and networking costs are not the binding constraint — the ~$12k/year Confluent Cloud cost is within budget
- The 3x annual growth rate continues, meaning the system must handle ~1,500 events/sec within 2 years and ~4,500 events/sec within 3 years
- Downstream webhook consumers can tolerate 5-50ms additional latency introduced by Kafka buffering
Operational signals to watch
reversal — If downstream webhook consumers have hard sub-50ms delivery SLAs that cannot be renegotiated, and the PoC validates that Kafka adds >50ms latency
reversal — If growth rate slows and 500 events/sec target is revised down to <200 events/sec, and team lacks Kafka expertise
reversal — If the team cannot staff 2+ engineers for Kafka migration without halting feature development
Unresolved uncertainty
- Whether the 8-engineer team has sufficient Kafka expertise to execute migration in 6 months — Confluent Cloud reduces but does not eliminate the learning curve
- b003's latency concern is valid: Kafka buffering may add 5-50ms latency depending on configuration, and it's unclear whether downstream webhook consumers have hard sub-100ms SLA requirements that would be violated
- b004 and b005 were empty branches at 0.50 confidence — unclear what positions they would have represented, leaving potential alternatives unexplored
- The $12,000/year cost estimate is synthetic — actual Confluent Cloud pricing depends on throughput, retention, and connector usage that weren't specified
- Whether EventBridge might actually be sufficient at this scale (500 events/sec is well within its limits) with lower operational burden — this alternative was not thoroughly stress-tested in debate
Branch battle map
Battle timeline (3 rounds)
Round 1 — Initial positions · 1 branches
Branch b001 (Vulcan) eliminated — outperformed by rival branch
Round 2 — Adversarial probes · 3 branches
Loki proposed branch b003
Socrates proposed branch b004
Loki
Kafka's durability and throughput come at the cost of added latency (millisecond…
Socrates
Round 3 — Final convergence · 4 branches
Socrates proposed branch b005
Loki proposed branch b006
Socrates
Loki
Webhook scaling is akin to managing a fleet of taxis dispatched directly by phon…