Snowflake vs self-hosted ClickHouse for a 5TB/day analytics pipeline. Team of 3 data engineers, $15K/month budget, need sub-second dashboard queries.

accepted_conditional · Pro · 550s · $0.58
5 branches explored · 2 survived · 3 rounds · integrity 75%
Deploy self-hosted ClickHouse on 3x AWS i3.2xlarge instances ($4,200/month compute) with S3 tiered storage for 90+...
Confidence
82%
Risk unknown 550s
Decision timeline Verdict

Deploy self-hosted ClickHouse on a 3-node cluster using i3.2xlarge instances with ReplicatedMergeTree engine and...

Decision
82%
Execution
Uncertainty

Decision

  1. Deploy self-hosted ClickHouse on 3x AWS i3.2xlarge instances ($4,200/month compute) with S3 tiered storage for 90+ day data (~$3,000/month at steady state), totaling $7,200-$9,000/month. Use ReplicatedMergeTree engine with ClickHouse Keeper (not ZooKeeper). This stays well within the $15K budget while Snowflake structurally cannot — at 5TB/day, Snowflake storage alone reaches $15K/month within 4-5 months, with compute credits pushing to $20K-$35K/month by month
  2. ClickHouse on NVMe delivers sub-100ms on typical dashboard aggregations over billion-row tables with proper MergeTree ORDER BY keys, exceeding the sub-second requirement by 10x. Critical failure mode: operational burnout. A 3-person team will spend 20-30% of time on ops. If one engineer leaves, the remaining two face unsustainable burden. Mitigation: document runbooks aggressively in the first 60 days and keep ClickHouse Cloud as an escape hatch. Provision with Ansible/Terraform, enforce 90-day tiered storage policy to keep NVMe under 80% capacity.

Next actions

Write Terraform + Ansible provisioning for 3x i3.2xlarge ClickHouse cluster with ClickHouse Keeper, NVMe storage configuration, and S3 tiered storage bucket with lifecycle policy
infra · immediate
Design ReplicatedMergeTree table schemas with ORDER BY keys aligned to top 5 dashboard query patterns; benchmark sub-second performance with representative 1TB sample data
data · immediate
Create operational runbooks for ClickHouse upgrades, disk capacity alerts (80% NVMe threshold), replication lag monitoring, and Keeper failure recovery within the first 60 days
infra · before_launch
Set up Grafana + Prometheus dashboards tracking NVMe usage, replication lag, query P99 latency, and monthly infrastructure cost against $15K ceiling
infra · before_launch
Define ClickHouse Cloud escape hatch trigger: if ops time exceeds 40% of team capacity for 2 consecutive sprints OR team drops below 2 engineers, initiate migration evaluation
data · before_launch
This verdict stops being true when
Budget increases to $30K+/month or Snowflake offers a negotiated enterprise rate below $10/TB compressed storage → Snowflake becomes viable — its managed service model eliminates the operational burnout risk that is ClickHouse's primary failure mode for a 3-person team
Team drops to 2 or fewer engineers, or ops burden exceeds 40% of team capacity for 2+ consecutive sprints → Migrate to ClickHouse Cloud managed service as the pre-planned escape hatch, trading higher cost for reduced operational burden
Ingestion rate drops below 1TB/day due to successful data volume reduction or business scope change → Re-evaluate Snowflake — at lower volumes, storage costs stay within budget and the managed service benefit outweighs ClickHouse operational overhead
Full council reasoning, attack grid, and flip conditions included with Pro

Council notes

Socrates
**Split: First reduce data volume before choosing platform.** Before committing to either Snowflake or ClickHouse, co...
Vulcan
1) Choose Snowflake if operational simplicity matters more than cost predictability; 2) Choose self-hosted ClickHouse...
Daedalus
**Recommend: Self-hosted ClickHouse on a 3-node cluster with ReplicatedMergeTree engine and ClickHouse Keeper.** **A...
Loki
Self-hosted ClickHouse on 3 nodes for 5TB/day begs the question: with only 3 data engineers, who manages the inevitab...

Assumptions

  • The $15K/month budget is a hard ceiling that cannot be negotiated upward
  • 5TB/day ingestion rate is sustained and will not decrease significantly
  • The 3-person data engineering team has or can acquire sufficient ClickHouse operational skills within 60 days
  • Dashboard query patterns are predictable enough to align with MergeTree ORDER BY keys for sub-second performance
  • AWS i3.2xlarge instances remain available and priced at approximately $1,400/month each

Operational signals to watch

reversal — Budget increases to $30K+/month or Snowflake offers a negotiated enterprise rate below $10/TB compressed storage
reversal — Team drops to 2 or fewer engineers, or ops burden exceeds 40% of team capacity for 2+ consecutive sprints
reversal — Ingestion rate drops below 1TB/day due to successful data volume reduction or business scope change

Unresolved uncertainty

  • Whether the 3-person team has sufficient ClickHouse operational expertise — if all three are Snowflake/Redshift-experienced with zero ClickHouse exposure, the 60-day ramp-up estimate may be optimistic
  • Schema evolution failure mode was cut off in b003's output — the full severity and mitigation for ALTER TABLE blocking on 10B+ row ReplicatedMergeTree tables was not fully articulated
  • The $23/TB Snowflake compressed storage pricing is the current list rate; negotiated enterprise pricing could extend Snowflake's budget viability, though likely not enough to overcome the structural gap at 5TB/day
  • Query pattern diversity is unknown — sub-100ms assumes dashboard queries align with MergeTree ORDER BY keys. Ad-hoc exploratory queries on non-indexed dimensions may exceed sub-second SLA
  • b004's killed argument about Keeper failure during peak ingest causing hours of recovery was not fully rebutted — this is a real operational risk for a 3-person team

Branch battle map

R1R2R3Censor reopenb001b002b003b004b005
Battle timeline (3 rounds)
Round 1 — Initial positions · 2 branches
Branch b002 (Vulcan) eliminated — Branch b002 is structurally empty — it provides no reco...
Round 2 — Adversarial probes · 3 branches
Loki proposed branch b004
Branch b004 (Loki) eliminated — auto-pruned: unsupported low-confidence branch
Socrates proposed branch b005
Loki Self-hosted ClickHouse on 3 nodes for 5TB/day begs the question: with only 3 dat…
Socrates **Split: First reduce data volume before choosing platform.** Before committing …
Round 3 — Final convergence · 2 branches
Branch b005 (Socrates) eliminated — This branch is structurally unsound because it treats a p...
Markdown JSON