We have a 200K-line Django monolith serving 15K RPM. Which service should we extract first? The app has payments, notifications, user profiles, search, and an admin dashboard. Team is 6 backend engineers, we're on AWS with RDS Postgres, and downtime during extraction is not acceptable.
Extract notifications first using AWS SNS/SQS with the strangler fig pattern. Notifications are structurally the...
Confidence
73%
Risk
unknown
487s
Extract the notifications service first using AWS SNS/SQS as the decoupling layer
Decision
- Extract notifications first using AWS SNS/SQS with the strangler fig pattern. Notifications are structurally the safest first extraction because they are a leaf node — they depend on other services but nothing depends on them. If the notification service fails, payments still process and search still works. Deploy a new service (FastAPI or standalone Django) on ECS Fargate behind an internal ALB. The monolith publishes notification events to SQS; the new service consumes and dispatches via SES/SNS/Twilio. Start with ONE notification type (e.g., email confirmations) to validate the pattern. Keep existing notification tables in RDS Postgres initially with read-only access via separate connection pool; migrate to dedicated RDS in phase
- Route via ALB path-based rules with feature flag fallback. Critical: implement idempotency using notification_id + timestamp hash to handle SQS at-least-once delivery. Add dead-letter queues for failed publishes. Timeline: weeks 1-4 design + SQS integration, weeks 5-12 incremental traffic shift. This extraction's primary value is not performance improvement — it's establishing a proven extraction pattern and team muscle memory for the harder extractions (payments, search) that follow.
Next actions
Write the service extraction RFC with SQS topic design, ALB routing rules, idempotency schema, and rollback procedure
Run pg_stat_statements analysis on the monolith's top 20 queries to quantify current database contention and validate that notification queries are not in the critical path
Set up an SQS queue with dead-letter queue and a minimal consumer service on ECS Fargate that processes one notification type (e.g., email confirmation) in shadow mode alongside the monolith
Configure CloudWatch dashboards tracking SQS message age, DLQ depth, notification delivery success rate, and monolith notification endpoint latency as the baseline before traffic shift
Run load test at 2x current notification volume against the new service in shadow mode to verify ECS Fargate task sizing and SQS consumer throughput before any traffic shift
This verdict stops being true when
pg_stat_statements analysis reveals that notification-related queries are among the top 5 contention sources and are transactionally coupled to payment or profile writes → Extract notifications but prioritize database decoupling (separate schema/RDS instance) as phase 1 rather than SQS-first, or reconsider extraction order
The actual driver for extraction is deployment velocity (teams blocking each other on releases) rather than scalability or modularity → Invest in monolith modularization (Django app boundaries, separate test suites, trunk-based development with feature flags) which delivers deployment independence faster than service extraction
Team analysis reveals search is consuming >40% of database resources and growing, while notifications are negligible → Extract search first using Elasticsearch with CDC pipeline, accepting the longer timeline but addressing the actual bottleneck
Full council reasoning, attack grid, and flip conditions included with Pro
Council notes
Socrates
Vulcan
Daedalus
Loki
Assumptions
- Notifications in this Django monolith follow the typical pattern of being event consumers (triggered after payments/profile changes complete), not transactional co-participants sharing ACID boundaries with other domains
- The team has or can quickly acquire operational familiarity with AWS SQS, ECS Fargate, and ALB routing — these are standard AWS services but new operational surface area
- Zero-downtime extraction is achievable via the strangler fig pattern with ALB routing and feature flags, meaning the monolith's notification path can remain active as a fallback during the entire migration
- The 6-engineer team can sustain feature development while 2-3 engineers focus on extraction — if the team is already at capacity, extraction velocity drops significantly
- RDS Postgres at 15K RPM has sufficient connection headroom for a new service's separate connection pool without requiring pgBouncer or connection pool infrastructure changes
Operational signals to watch
reversal — pg_stat_statements analysis reveals that notification-related queries are among the top 5 contention sources and are transactionally coupled to payment or profile writes
reversal — The actual driver for extraction is deployment velocity (teams blocking each other on releases) rather than scalability or modularity
reversal — Team analysis reveals search is consuming >40% of database resources and growing, while notifications are negligible
Unresolved uncertainty
- Whether the monolith's actual bottleneck is database contention, deployment coupling, or team coordination — this affects whether extraction delivers meaningful value or is premature optimization of architecture
- The actual notification volume as a percentage of 15K RPM is unknown — if notifications are only 1-2% of traffic, the extraction exercises the pattern but provides negligible operational relief
- b006's reframe (optimize monolith first) was not fully defeated — at 15K RPM, a well-tuned Django monolith on RDS may have years of headroom before extraction becomes necessary
- Whether the team has prior experience operating distributed systems and message queues — if not, the operational overhead of SQS + ECS Fargate + monitoring + alerting could exceed the 6-engineer team's capacity
- The killed b003 branch contained more detailed implementation specifics (ECS Fargate sizing, DynamoDB for idempotency, weighted target groups) than the surviving b002 — the strongest tactical guidance was lost when b003 was killed
Branch battle map
Battle timeline (3 rounds)
Round 1 — Initial positions · 3 branches
Branch b001 (Socrates) eliminated — Branch b001 proposes extracting user profiles first, but ...
Socrates proposed branch b004
Socrates
Before any service extraction, conduct a 2-week system analysis to identify true…
Round 2 — Adversarial probes · 2 branches
Branch b003 (Daedalus) eliminated — The dominant position (b003) assumes that notifications a...
Loki proposed branch b005
Branch b005 (Loki) eliminated — b005 claims notifications are 'tightly bound to user prof...
Loki
Notifications extraction (b002/b003) assumes low coupling, but they're tightly b…
Round 3 — Final convergence · 3 branches
Socrates proposed branch b006
Socrates
Challenge the premise that extraction is necessary. Instead, focus on optimizing…