We have a 200K-line Django monolith serving 15K RPM. Which service should we extract first? The app has payments, notifications, user profiles, search, and an admin dashboard. Team is 6 backend engineers, we're on AWS with RDS Postgres, and downtime during extraction is not acceptable.

accepted_conditional Software architecturesoftware_operational · Pro · 487s · $0.56

Viewing as: Open

6 branches explored · 3 survived · 3 rounds · integrity 75%

WeakStrong

Candidate estimate (inferred)

Risk unknown 487s

Read brief Open timeline MD ↓ Pro JSON ↓ Pro PDF ↓ Ent

Decision timeline Verdict

Extract the notifications service first using AWS SNS/SQS as the decoupling layer

Decision

73%

Execution

—

Uncertainty

—

Decision

Concrete components, topology, and thresholds named below are candidate mitigations or example implementations inferred by the Council. They were not confirmed in your filing or established as part of your current environment.

Extract notifications first using AWS SNS/SQS with the strangler fig pattern. Notifications are structurally the safest first extraction because they are a leaf node — they depend on other services but nothing depends on them. If the notification service fails, payments still process and search still works. Deploy a new service (FastAPI or standalone Django) on ECS Fargate behind an internal ALB. The monolith publishes notification events to SQS; the new service consumes and dispatches via SES/SNS/Twilio. Start with ONE notification type (e.g., email confirmations) to validate the pattern. Keep existing notification tables in RDS Postgres initially with read-only access via separate connection pool; migrate to dedicated RDS in phase
Route via ALB path-based rules with feature flag fallback. Critical: implement idempotency using notification_id + timestamp hash to handle SQS at-least-once delivery. Add dead-letter queues for failed publishes. Timeline: weeks 1-4 design + SQS integration, weeks 5-12 incremental traffic shift. This extraction's primary value is not performance improvement — it's establishing a proven extraction pattern and team muscle memory for the harder extractions (payments, search) that follow.

Inferred specifics

Structured audit rows for Council-added details. Synthetic basis means the detail was introduced by analysis, not supplied by the filing.

Value	Kind	Basis	Where introduced
migrate to dedicated RDS in phase 2	estimate	synthetic	chosen_path
Timeline: weeks 1-4 design + SQS integration	estimate	synthetic	chosen_path
weeks 5-12 incremental traffic shift	estimate	synthetic	chosen_path
1	estimate	synthetic	next_action
2	estimate	synthetic	next_action
3	estimate	synthetic	next_action
4	estimate	synthetic	next_action
5	estimate	synthetic	next_action
0.80	estimate	synthetic	selection_rationale
0.75	estimate	synthetic	selection_rationale
0.65	estimate	synthetic	selection_rationale
b002 survived 3 rounds of adversarial scrutiny	estimate	synthetic	selection_rationale
Killed in round 1	estimate	synthetic	rejected_alternatives.rationale
modules within 8 weeks	estimate	synthetic	rejected_alternatives.rationale
Conduct 2-week system analysis before any	estimate	synthetic	rejected_alternatives.path
at 0.65	version	synthetic	rejected_alternatives.rationale
at 0.75	version	synthetic	rejected_alternatives.rationale
Killed in round 2	estimate	synthetic	rejected_alternatives.rationale
is a 3-6 month project versus the	estimate	synthetic	rejected_alternatives.rationale
monolith's top 20 queries to quantify current	estimate	synthetic	structured_next_actions.description

Highest-probability failure mode: not computed - insufficient evidence in filing to identify with confidence.

Next actions

Write the service extraction RFC with SQS topic design, ALB routing rules, idempotency schema, and rollback procedure

backend · immediate

Candidate estimate (inferred, not source-confirmed): Run pg_stat_statements analysis on the monolith's top 20 queries to quantify current database contention and validate that notification queries are not in the critical path

backend · immediate

Set up an SQS queue with dead-letter queue and a minimal consumer service on ECS Fargate that processes one notification type (e.g., email confirmation) in shadow mode alongside the monolith

backend · immediate

Configure CloudWatch dashboards tracking SQS message age, DLQ depth, notification delivery success rate, and monolith notification endpoint latency as the baseline before traffic shift

infra · before_launch

Candidate estimate (inferred, not source-confirmed): Run load test at 2x current notification volume against the new service in shadow mode to verify ECS Fargate task sizing and SQS consumer throughput before any traffic shift

backend · before_launch

Verdict-to-Work

A model gives you advice. VectorCourt turns the verdict into accountable work.

pg_stat_statements analysis reveals that notification-related queries are among the top 5 contention sources and are transactionally coup...

Reversal condition · observed · investigation_wo

Create investigation WO

Extract notifications but prioritize database decoupling (separate schema/RDS instance) as phase 1 rather than SQS-first, or reconsider extraction order

Evidence boundary: condition flips verdict when observed

Export as markdown

### pg_stat_statements analysis reveals that notification-related queries are among the top 5 contention sources and are transactionally coup...

- Finding ID: `reversal_condition:1_pg_stat_statements_analysis_reveals_that_notification-related_queries_are_among_the_top_5_contention_source`
- Subtype: `reversal_condition`
- Evidence status: `observed`
- Default work type: `investigation_wo`
- Summary: Extract notifications but prioritize database decoupling (separate schema/RDS instance) as phase 1 rather than SQS-first, or reconsider extraction order
- Evidence boundary: condition flips verdict when observed
- Reversal condition: pg_stat_statements analysis reveals that notification-related queries are among the top 5 contention sources and are transactionally coupled to payment or profile writes

Acceptance criteria:
- Root cause or measurement plan is identified for the reversal condition.
- Evidence status remains marked synthetic until measured.
- Follow-up implementation work is created only after evidence is observed.

Upgrade to Pro to create governed work from this finding.

The actual driver for extraction is deployment velocity (teams blocking each other on releases) rather than scalability or modularity

Reversal condition · observed · investigation_wo

Create investigation WO

Invest in monolith modularization (Django app boundaries, separate test suites, trunk-based development with feature flags) which delivers deployment independence faster than service extraction

Evidence boundary: condition flips verdict when observed

Export as markdown

### The actual driver for extraction is deployment velocity (teams blocking each other on releases) rather than scalability or modularity

- Finding ID: `reversal_condition:2_the_actual_driver_for_extraction_is_deployment_velocity__teams_blocking_each_other_on_releases__rather_than`
- Subtype: `reversal_condition`
- Evidence status: `observed`
- Default work type: `investigation_wo`
- Summary: Invest in monolith modularization (Django app boundaries, separate test suites, trunk-based development with feature flags) which delivers deployment independence faster than service extraction
- Evidence boundary: condition flips verdict when observed
- Reversal condition: The actual driver for extraction is deployment velocity (teams blocking each other on releases) rather than scalability or modularity

Acceptance criteria:
- Root cause or measurement plan is identified for the reversal condition.
- Evidence status remains marked synthetic until measured.
- Follow-up implementation work is created only after evidence is observed.

Upgrade to Pro to create governed work from this finding.

Team analysis reveals search is consuming >40% of database resources and growing, while notifications are negligible

Reversal condition · observed · investigation_wo

Create investigation WO

Extract search first using Elasticsearch with CDC pipeline, accepting the longer timeline but addressing the actual bottleneck

Evidence boundary: condition flips verdict when observed

Export as markdown

### Team analysis reveals search is consuming >40% of database resources and growing, while notifications are negligible

- Finding ID: `reversal_condition:3_team_analysis_reveals_search_is_consuming__40__of_database_resources_and_growing__while_notifications_are_n`
- Subtype: `reversal_condition`
- Evidence status: `observed`
- Default work type: `investigation_wo`
- Summary: Extract search first using Elasticsearch with CDC pipeline, accepting the longer timeline but addressing the actual bottleneck
- Evidence boundary: condition flips verdict when observed
- Reversal condition: Team analysis reveals search is consuming >40% of database resources and growing, while notifications are negligible

Acceptance criteria:
- Root cause or measurement plan is identified for the reversal condition.
- Evidence status remains marked synthetic until measured.
- Follow-up implementation work is created only after evidence is observed.

Upgrade to Pro to create governed work from this finding.

Write the service extraction RFC with SQS topic design, ALB routing rules, idempotency schema, and rollback procedure

Repair action · observed · repair_wo

Create repair WO

implement

Evidence boundary: backend

Export as markdown

### Write the service extraction RFC with SQS topic design, ALB routing rules, idempotency schema, and rollback procedure

- Finding ID: `repair_action:1_write_the_service_extraction_rfc_with_sqs_topic_design__alb_routing_rules__idempotency_schema__and_rollback_proc`
- Subtype: `repair_action`
- Evidence status: `observed`
- Default work type: `repair_wo`
- Summary: implement
- Evidence boundary: backend

Acceptance criteria:
- The repair is implemented with deterministic verification.
- The source verdict is linked for revalidation.

Upgrade to Pro to create governed work from this finding.

Run pg_stat_statements analysis on the monolith's top 20 queries to quantify current database contention and validate that notification q...

Repair action · observed · repair_wo

Create repair WO

investigate

Evidence boundary: backend

Export as markdown

### Run pg_stat_statements analysis on the monolith's top 20 queries to quantify current database contention and validate that notification q...

- Finding ID: `repair_action:2_run_pg_stat_statements_analysis_on_the_monolith_s_top_20_queries_to_quantify_current_database_contention_and_val`
- Subtype: `repair_action`
- Evidence status: `observed`
- Default work type: `repair_wo`
- Summary: investigate
- Evidence boundary: backend

Acceptance criteria:
- The repair is implemented with deterministic verification.
- The source verdict is linked for revalidation.

Upgrade to Pro to create governed work from this finding.

Set up an SQS queue with dead-letter queue and a minimal consumer service on ECS Fargate that processes one notification type (e.g., emai...

Repair action · observed · repair_wo

Create repair WO

implement

Evidence boundary: backend

Export as markdown

### Set up an SQS queue with dead-letter queue and a minimal consumer service on ECS Fargate that processes one notification type (e.g., emai...

- Finding ID: `repair_action:3_set_up_an_sqs_queue_with_dead-letter_queue_and_a_minimal_consumer_service_on_ecs_fargate_that_processes_one_noti`
- Subtype: `repair_action`
- Evidence status: `observed`
- Default work type: `repair_wo`
- Summary: implement
- Evidence boundary: backend

Acceptance criteria:
- The repair is implemented with deterministic verification.
- The source verdict is linked for revalidation.

Upgrade to Pro to create governed work from this finding.

Configure CloudWatch dashboards tracking SQS message age, DLQ depth, notification delivery success rate, and monolith notification endpoi...

Repair action · observed · repair_wo

Create repair WO

monitor

Evidence boundary: infra

Export as markdown

### Configure CloudWatch dashboards tracking SQS message age, DLQ depth, notification delivery success rate, and monolith notification endpoi...

- Finding ID: `repair_action:4_configure_cloudwatch_dashboards_tracking_sqs_message_age__dlq_depth__notification_delivery_success_rate__and_mon`
- Subtype: `repair_action`
- Evidence status: `observed`
- Default work type: `repair_wo`
- Summary: monitor
- Evidence boundary: infra

Acceptance criteria:
- The repair is implemented with deterministic verification.
- The source verdict is linked for revalidation.

Upgrade to Pro to create governed work from this finding.

Run load test at 2x current notification volume against the new service in shadow mode to verify ECS Fargate task sizing and SQS consumer...

Repair action · observed · repair_wo

Create repair WO

validate

Evidence boundary: backend

Export as markdown

### Run load test at 2x current notification volume against the new service in shadow mode to verify ECS Fargate task sizing and SQS consumer...

- Finding ID: `repair_action:5_run_load_test_at_2x_current_notification_volume_against_the_new_service_in_shadow_mode_to_verify_ecs_fargate_tas`
- Subtype: `repair_action`
- Evidence status: `observed`
- Default work type: `repair_wo`
- Summary: validate
- Evidence boundary: backend

Acceptance criteria:
- The repair is implemented with deterministic verification.
- The source verdict is linked for revalidation.

Upgrade to Pro to create governed work from this finding.

This verdict stops being true when

Candidate estimate (inferred, not source-confirmed): pg_stat_statements analysis reveals that notification-related queries are among the top 5 contention sources and are transactionally coupled to payment or profile writes → Candidate estimate (inferred, not source-confirmed): Extract notifications but prioritize database decoupling (separate schema/RDS instance) as phase 1 rather than SQS-first, or reconsider extraction order

The actual driver for extraction is deployment velocity (teams blocking each other on releases) rather than scalability or modularity → Invest in monolith modularization (Django app boundaries, separate test suites, trunk-based development with feature flags) which delivers deployment independence faster than service extraction

Candidate estimate (inferred, not source-confirmed): Team analysis reveals search is consuming >40% of database resources and growing, while notifications are negligible → Extract search first using Elasticsearch with CDC pipeline, accepting the longer timeline but addressing the actual bottleneck

Full council reasoning, attack grid, and flip conditions included with Pro

Council notes

Socrates

Challenge the premise that extraction is necessary. Instead, focus on optimizing the monolith through incremental imp...

Vulcan

Pursue notifications extraction first with an event-driven architecture based on a pub-sub system like AWS SNS/SQS, s...

Daedalus

Extract notifications service first (Alternative B). Concrete architecture: Deploy a new Python FastAPI service (or D...

Loki

Notifications extraction (b002/b003) assumes low coupling, but they're tightly bound to user profiles and payments vi...

Evidence boundary

Observed from your filing

We have a 200K-line Django monolith serving 15K RPM. Which service should we extract first? The app has payments, notifications, user profiles,
search, and an admin dashboard. Team is 6 backend engineers, we're on AWS with RDS Postgres, and downtime during extraction is not acceptable.

Assumptions used for analysis

Notifications in this Django monolith follow the typical pattern of being event consumers (triggered after payments/profile changes complete), not transactional co-participants sharing ACID boundaries with other domains
The team has or can quickly acquire operational familiarity with AWS SQS, ECS Fargate, and ALB routing — these are standard AWS services but new operational surface area
Zero-downtime extraction is achievable via the strangler fig pattern with ALB routing and feature flags, meaning the monolith's notification path can remain active as a fallback during the entire migration
The 6-engineer team can sustain feature development while 2-3 engineers focus on extraction — if the team is already at capacity, extraction velocity drops significantly
RDS Postgres at 15K RPM has sufficient connection headroom for a new service's separate connection pool without requiring pgBouncer or connection pool infrastructure changes
existing stack synthetic default (not observed): greenfield assumed [synthetic] (not_addressed)
connection pooler not provided: not specified [not_provided] (not_addressed)
deployment model not provided: not specified [not_provided] (not_addressed)

Inferred candidate specifics

These details were introduced by the Council during analysis. They were not supplied in your filing.

Extract notifications first using AWS SNS/SQS with the strangler fig pattern. Notifications are structurally the safest first extraction because they are a leaf node — they depend on other services but nothing depends on them. If the notification service fails, payments still process and search still works. Deploy a new service (FastAPI or standalone Django) on ECS Fargate behind an internal ALB. The monolith publishes notification events to SQS; the new service consumes and dispatches via SES/SNS/Twilio. Start with ONE notification type (e.g., email confirmations) to validate the pattern. Keep existing notification tables in RDS Postgres initially with read-only access via separate connection pool; migrate to dedicated RDS in phase 2. Route via ALB path-based rules with feature flag fallback. Critical: implement idempotency using notification_id + timestamp hash to handle SQS at-least-once delivery. Add dead-letter queues for failed publishes. Timeline: weeks 1-4 design + SQS integration, weeks 5-12 incremental traffic shift. This extraction's primary value is not performance improvement — it's establishing a proven extraction pattern and team muscle memory for the harder extractions (payments, search) that follow.
Create a service extraction RFC documenting: (1) the SQS topic structure for notification events, (2) the specific notification type to extract first, (3) the ALB path-based routing rules for the strangler fig pattern, (4) the idempotency key schema, and (5) the rollback procedure if the new service fails — circulate to all 6 engineers for review within 5 business days.
b002 won with 0.80 confidence as the highest-confidence surviving branch with a concrete, actionable extraction recommendation. b006 (0.75) was a reframe challenging the extraction premise — valid as a strategic consideration but does not answer the extraction question. b004 (0.65) proposed analysis-first which delays action without structural justification, since the notifications-first argument is based on dependency topology (leaf node), not performance profiling. b002 survived 3 rounds of adversarial scrutiny including attacks on notification coupling (b005, killed) and database contention concerns (b004/b006). The leaf-node argument was never successfully refuted.
Extract user profiles first with read/write splitting and logical replication
Killed in round 1. User profiles are the most coupled entity — virtually every request touches user data. Extracting the hub node first forces solving the hardest distributed data problem across ALL modules simultaneously. At 15K RPM, adding a network hop to nearly every request path cascades latency. 6 engineers cannot refactor all dependent modules within 8 weeks.
Conduct 2-week system analysis before any extraction to identify true bottlenecks
b004 at 0.65 confidence. Valid concern but delays action without guaranteeing different conclusions. The architectural argument for notifications-first is structural (leaf node, contained blast radius), not performance-based — it holds regardless of what bottleneck analysis reveals. However, the core insight about database contention is worth integrating: run query profiling in parallel with extraction planning, not as a blocking prerequisite.
b006 at 0.75 confidence — a strong reframe. Correctly challenges whether extraction is necessary at 15K RPM. However, the question presupposes extraction is the goal, and this branch provides no actionable extraction guidance. Its insights (optimize first) should be treated as a parallel workstream, not a replacement. If the team's actual problem is deployment velocity or team autonomy rather than performance, monolith optimization doesn't solve it.

Unknowns blocking a firmer verdict

Whether the monolith's actual bottleneck is database contention, deployment coupling, or team coordination — this affects whether extraction delivers meaningful value or is premature optimization of architecture
The actual notification volume as a percentage of 15K RPM is unknown — if notifications are only 1-2% of traffic, the extraction exercises the pattern but provides negligible operational relief
b006's reframe (optimize monolith first) was not fully defeated — at 15K RPM, a well-tuned Django monolith on RDS may have years of headroom before extraction becomes necessary
Whether the team has prior experience operating distributed systems and message queues — if not, the operational overhead of SQS + ECS Fargate + monitoring + alerting could exceed the 6-engineer team's capacity
The killed b003 branch contained more detailed implementation specifics (ECS Fargate sizing, DynamoDB for idempotency, weighted target groups) than the surviving b002 — the strongest tactical guidance was lost when b003 was killed

Operational signals to watch

reversal — Candidate estimate (inferred, not source-confirmed): pg_stat_statements analysis reveals that notification-related queries are among the top 5 contention sources and are transactionally coupled to payment or profile writes

reversal — The actual driver for extraction is deployment velocity (teams blocking each other on releases) rather than scalability or modularity

reversal — Candidate estimate (inferred, not source-confirmed): Team analysis reveals search is consuming >40% of database resources and growing, while notifications are negligible

Branch battle map

Battle timeline (3 rounds)

Round 1 — Initial positions · 3 branches

Branch b001 (Socrates) eliminated — Branch b001 proposes extracting user profiles first, but ...

Socrates proposed branch b004

Socrates Before any service extraction, conduct a 2-week system analysis to identify true…

Round 2 — Adversarial probes · 2 branches

Branch b003 (Daedalus) eliminated — The dominant position (b003) assumes that notifications a...

Loki proposed branch b005

Branch b005 (Loki) eliminated — b005 claims notifications are 'tightly bound to user prof...

Loki Notifications extraction (b002/b003) assumes low coupling, but they're tightly b…

Round 3 — Final convergence · 3 branches

Socrates proposed branch b006

Socrates Challenge the premise that extraction is necessary. Instead, focus on optimizing…

Evidence source proof

evidence source proof not available for legacy verdicts pre-2026-05-20

Markdown JSON

Council chamber

Socrates

Analyst

Vulcan

Engineer

Daedalus

Architect

Loki

Disruptor

655fdf62-f165-4c7e-a8ca-b210253d24eb · Protocol

Council archetypes represent independent reasoning perspectives. They are not individuals but structured reasoning roles.

VectorCourt processes filings through approved AI providers; per-verdict model routing is disclosed in Enterprise audit exports.

This verdict is a structured reasoning artifact, not professional advice. VectorCourt does not provide legal, financial, medical, or other professional advice. You are responsible for your own decisions.

VectorCourt · Pricing · Terms · Privacy · Refund Policy · Clerk, not judge