Should we migrate from DynamoDB to PostgreSQL with Citus for a multi-tenant SaaS with 2,000 tenants and 50ms p99 latency?

accepted_conditional · Pro · 646s · $0.86

This verdict assumes 40% of constraints

The following constraints were not provided and default values were used:

7 branches explored · 3 survived · 3 rounds · integrity 75%
Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores)
Confidence
72%
Risk unknown 646s
Decision timeline Verdict

Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL) with 1 coordinator (8 vCores)

Decision
72%
Execution
Uncertainty

Decision

Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Cosmos DB for PostgreSQL): 1 coordinator (8 vCores, 32GB RAM) + 4 worker nodes (4 vCores, 32GB RAM each). Use tenant_id as distribution column with co-location. Estimated cost ~$4,200/month vs $28K+/month DynamoDB. Single-tenant queries (90%+ of workload) route to a single shard at 5-15ms p99; cross-tenant JOINs hit 20-45ms p99, meeting the 50ms target with ~10% headroom. Use pgloader for bulk migration, AWS DMS for CDC during cutover, with a 2-week dual-write period (DynamoDB as read fallback via application-level routing). Critical failure mode: hot tenant skew. If the top 3 tenants represent >40% of data/queries, isolate them onto dedicated worker nodes using Citus tenant isolation (shard_count=1 per large tenant). If skew exceeds 60% on any single worker, p99 will breach 50ms under concurrent load. Self-managed Citus on AWS is rejected as a hidden budget killer — dual-running DynamoDB ($28K/month) + self-managed Citus ($8K/month) + engineering blows the budget by month 4.

Next actions

Deploy proof-of-concept Azure Hyperscale Citus cluster (1 coordinator + 1 worker), load top 3 tenants by data volume, replay 24h production query logs, measure p99 latency
backend · immediate
Measure current DynamoDB hot partition distribution: identify top 3 tenants by query volume and data size, calculate skew percentage to determine if tenant isolation will be needed on Citus
data · immediate
Set up pgloader bulk migration pipeline and AWS DMS CDC replication from DynamoDB to Citus staging environment
infra · before_launch
If existing infrastructure is AWS-only, evaluate whether cross-cloud latency to Azure is acceptable or whether self-managed Citus on AWS with extended budget timeline is preferable
infra · immediate
Set up p99 latency alerting at 45ms threshold (5ms buffer) on the Citus coordinator and per-worker node query latency dashboards
infra · before_launch
This verdict stops being true when
DynamoDB costs are primarily driven by implementation issues (poor partition key design, over-provisioned capacity) and a 30%+ cost reduction is achievable through optimization alone → Optimize existing DynamoDB setup: redesign partition keys, implement auto-scaling, add DAX caching layer, defer migration
Proof-of-concept shows coordinator bottleneck at 2,000 tenants causes p99 > 50ms under production-equivalent concurrent load → Evaluate self-managed Citus on AWS with multiple coordinators, or consider CockroachDB/TiDB as distributed SQL alternatives without single-coordinator constraint
Existing infrastructure is entirely AWS-native and cross-cloud latency to Azure adds >10ms to p99, eating the safety margin → Deploy self-managed Citus on AWS EC2/EKS with increased budget allocation for DBA operational overhead
Full council reasoning, attack grid, and flip conditions included with Pro

Council notes

Vulcan
Propose a hybrid architecture: retain DynamoDB for read-heavy, non-relational workloads while introducing PostgreSQL ...
Socrates
Before considering migration, conduct a comprehensive database implementation audit of the current DynamoDB setup. Ma...
Daedalus
RECOMMENDATION: Migrate to PostgreSQL with Citus on Azure Managed Hyperscale (Citus), NOT self-managed Citus on EC2/R...
Loki
Azure Cosmos DB for PostgreSQL (Hyperscale Citus) recommendation ignores real-world coordinator bottlenecks: with 2,0...

Assumptions

  • DynamoDB cost (~$28K/month) is the primary driver for migration, not a misidentified implementation issue
  • The existing SaaS can tolerate a cross-cloud database dependency on Azure if other services remain on AWS
  • 90%+ of queries are single-tenant scoped (tenant_id filtered), making shard-local routing the dominant access pattern
  • The engineering team has sufficient PostgreSQL operational expertise to manage the migration and ongoing operations even with managed Citus
  • The 2-week dual-write cutover window is achievable given schema complexity and data volume across 2,000 tenants

Operational signals to watch

reversal — DynamoDB costs are primarily driven by implementation issues (poor partition key design, over-provisioned capacity) and a 30%+ cost reduction is achievable through optimization alone
reversal — Proof-of-concept shows coordinator bottleneck at 2,000 tenants causes p99 > 50ms under production-equivalent concurrent load
reversal — Existing infrastructure is entirely AWS-native and cross-cloud latency to Azure adds >10ms to p99, eating the safety margin

Unresolved uncertainty

  • Coordinator bottleneck at 2,000 tenants: killed branch b005 cited case studies from Framer and Heap showing coordinator hotspotting spiking p99 to 150ms+. This was auto-pruned as unsupported but the concern is architecturally valid and untested in this specific workload profile.
  • Cross-cloud migration complexity: if existing services are on AWS, moving the database to Azure introduces cross-cloud latency and data transfer costs not accounted for in the $4,200/month estimate.
  • The $4,200/month Azure cost and $28K/month DynamoDB cost are model-generated projections without cited production benchmarks for this specific workload volume.
  • No evidence that the current DynamoDB bottleneck has been formally diagnosed — b002/b007's concern that the problem may be implementation rather than technology remains valid.
  • Actual query patterns and data volume per tenant not specified — latency projections assume typical multi-tenant SaaS workloads.

Branch battle map

R1R2R3Censor reopenb001b002b003b004b005b006b007
Battle timeline (3 rounds)
Round 1 — Initial positions · 3 branches
Branch b001 (Vulcan) eliminated — Branch b001 proposes a hybrid DynamoDB + PostgreSQL/Citus...
Socrates proposed branch b004
Socrates Reframe the problem: Instead of asking whether to migrate from DynamoDB to Postg…
Round 2 — Adversarial probes · 3 branches
Loki proposed branch b005
Branch b005 (Loki) eliminated — auto-pruned: unsupported low-confidence branch
Socrates proposed branch b006
Branch b006 (Socrates) eliminated — auto-pruned: unsupported low-confidence branch
Loki Azure Cosmos DB for PostgreSQL (Hyperscale Citus) recommendation ignores real-wo…
Socrates Instead of a simple yes/no migration decision, we should evaluate whether a hybr…
Round 3 — Final convergence · 3 branches
Branch b004 (Socrates) eliminated — Branch b004 proposes a 'polyglot persistence strategy' wi...
Socrates proposed branch b007
Socrates Before considering migration, conduct a comprehensive database implementation au…
Markdown JSON