At what scale does a webhook-driven SaaS architecture collapse and how should it be redesigned before that happens? We have 8 backend engineers on AWS. Current volume is 50 events/sec growing 3x/year. Considering Kafka, AWS EventBridge, or SQS+SNS. Need to handle 10x current load within 6 months.

accepted_conditional · Pro · 625s · $0.70
6 branches explored · 5 survived · 3 rounds · integrity 75%
Migrate to Apache Kafka via Confluent Cloud to handle the 500 events/sec target within 6 months. At 50 events/sec
Confidence
72%
Risk unknown 625s
Decision timeline Verdict

At what scale does a webhook-driven SaaS architecture collapse and how should it be redesigned before that happens?...

Decision
72%
Execution
Uncertainty

Decision

  1. Migrate to Apache Kafka via Confluent Cloud to handle the 500 events/sec target within 6 months. At 50 events/sec, direct webhook fan-out works. At 500 events/sec, webhook retry storms, endpoint timeouts, and cascading failures collapse the architecture — the dispatcher becomes the bottleneck. Kafka's partitioned commit log absorbs bursts durably, decouples producers from consumers, and scales horizontally via consumer groups. Use Confluent Cloud (not self-managed Kafka) to keep operational burden manageable for 8 engineers. Estimated cost: ~$12,000/year. Key failure mode: operational complexity still risks overwhelming the team — partition management, consumer group rebalancing, and offset tracking require dedicated learning investment. Mitigate by starting with a small number of partitions (8-
  2. and expanding only when throughput demands it. Second failure mode: under-provisioning the cluster leads to latency spikes under burst loads. Size for 2x the 500 events/sec target (1,000 events/sec peak capacity) to absorb growth and burst traffic. Retain webhooks as the delivery mechanism to downstream consumers but buffer through Kafka, converting synchronous fan-out into async consumer pulls.

Next actions

Deploy Confluent Cloud Basic cluster in existing AWS region, create primary event topic with 8 partitions and 7-day retention
backend · immediate
Build shadow-mode producer that mirrors 10% of live webhook traffic to Kafka topic; measure P50/P95/P99 end-to-end latency over 72 hours
backend · immediate
Audit all downstream webhook consumers for hard latency SLA requirements — document which endpoints require sub-100ms delivery vs. which tolerate seconds of delay
backend · immediate
Build consumer service that reads from Kafka and delivers webhooks with retry/DLQ logic, replacing direct fan-out
backend · before_launch
Set up alerting on consumer lag, partition rebalance events, and producer error rates in Confluent Cloud dashboard
infra · before_launch
This verdict stops being true when
If downstream webhook consumers have hard sub-50ms delivery SLAs that cannot be renegotiated, and the PoC validates that Kafka adds >50ms latency → Use SQS+SNS for async fan-out (lower latency, simpler ops) or keep direct webhooks with horizontal scaling of dispatcher nodes behind a load balancer
If growth rate slows and 500 events/sec target is revised down to <200 events/sec, and team lacks Kafka expertise → Use AWS EventBridge as serverless event router — sufficient throughput at that scale with near-zero operational overhead for 8 engineers
If the team cannot staff 2+ engineers for Kafka migration without halting feature development → Implement SQS+SNS fan-out as a simpler intermediate step that buys 12-18 months of headroom before needing Kafka
Full council reasoning, attack grid, and flip conditions included with Pro

Council notes

Vulcan
ALTERNATIVE C) SQS + SNS
Socrates
Loki
Webhook scaling is akin to managing a fleet of taxis dispatched directly by phone calls (current setup): fine at 50/s...

Assumptions

  • Current webhook architecture is synchronous fan-out where the producer directly calls each subscriber endpoint, creating O(subscribers × events) HTTP calls
  • The 8-engineer team can allocate 2-3 engineers to the Kafka migration while maintaining feature velocity with the remaining 5-6
  • AWS infrastructure and networking costs are not the binding constraint — the ~$12k/year Confluent Cloud cost is within budget
  • The 3x annual growth rate continues, meaning the system must handle ~1,500 events/sec within 2 years and ~4,500 events/sec within 3 years
  • Downstream webhook consumers can tolerate 5-50ms additional latency introduced by Kafka buffering

Operational signals to watch

reversal — If downstream webhook consumers have hard sub-50ms delivery SLAs that cannot be renegotiated, and the PoC validates that Kafka adds >50ms latency
reversal — If growth rate slows and 500 events/sec target is revised down to <200 events/sec, and team lacks Kafka expertise
reversal — If the team cannot staff 2+ engineers for Kafka migration without halting feature development

Unresolved uncertainty

  • Whether the 8-engineer team has sufficient Kafka expertise to execute migration in 6 months — Confluent Cloud reduces but does not eliminate the learning curve
  • b003's latency concern is valid: Kafka buffering may add 5-50ms latency depending on configuration, and it's unclear whether downstream webhook consumers have hard sub-100ms SLA requirements that would be violated
  • b004 and b005 were empty branches at 0.50 confidence — unclear what positions they would have represented, leaving potential alternatives unexplored
  • The $12,000/year cost estimate is synthetic — actual Confluent Cloud pricing depends on throughput, retention, and connector usage that weren't specified
  • Whether EventBridge might actually be sufficient at this scale (500 events/sec is well within its limits) with lower operational burden — this alternative was not thoroughly stress-tested in debate

Branch battle map

R1R2R3Censor reopenb001b002b003b004b005b006
Battle timeline (3 rounds)
Round 1 — Initial positions · 1 branches
Branch b001 (Vulcan) eliminated — outperformed by rival branch
Round 2 — Adversarial probes · 3 branches
Loki proposed branch b003
Socrates proposed branch b004
Loki Kafka's durability and throughput come at the cost of added latency (millisecond…
Socrates
Round 3 — Final convergence · 4 branches
Socrates proposed branch b005
Loki proposed branch b006
Socrates
Loki Webhook scaling is akin to managing a fleet of taxis dispatched directly by phon…
Markdown JSON