Snowflake vs self-hosted ClickHouse for a 5TB/day analytics pipeline. Team of 3 data engineers, $15K/month budget, need sub-second dashboard queries.
Deploy self-hosted ClickHouse on 3x AWS i3.2xlarge instances ($4,200/month compute) with S3 tiered storage for 90+...
Confidence
82%
Risk
unknown
550s
Deploy self-hosted ClickHouse on a 3-node cluster using i3.2xlarge instances with ReplicatedMergeTree engine and...
Decision
- Deploy self-hosted ClickHouse on 3x AWS i3.2xlarge instances ($4,200/month compute) with S3 tiered storage for 90+ day data (~$3,000/month at steady state), totaling $7,200-$9,000/month. Use ReplicatedMergeTree engine with ClickHouse Keeper (not ZooKeeper). This stays well within the $15K budget while Snowflake structurally cannot — at 5TB/day, Snowflake storage alone reaches $15K/month within 4-5 months, with compute credits pushing to $20K-$35K/month by month
- ClickHouse on NVMe delivers sub-100ms on typical dashboard aggregations over billion-row tables with proper MergeTree ORDER BY keys, exceeding the sub-second requirement by 10x. Critical failure mode: operational burnout. A 3-person team will spend 20-30% of time on ops. If one engineer leaves, the remaining two face unsustainable burden. Mitigation: document runbooks aggressively in the first 60 days and keep ClickHouse Cloud as an escape hatch. Provision with Ansible/Terraform, enforce 90-day tiered storage policy to keep NVMe under 80% capacity.
Next actions
Write Terraform + Ansible provisioning for 3x i3.2xlarge ClickHouse cluster with ClickHouse Keeper, NVMe storage configuration, and S3 tiered storage bucket with lifecycle policy
Design ReplicatedMergeTree table schemas with ORDER BY keys aligned to top 5 dashboard query patterns; benchmark sub-second performance with representative 1TB sample data
Create operational runbooks for ClickHouse upgrades, disk capacity alerts (80% NVMe threshold), replication lag monitoring, and Keeper failure recovery within the first 60 days
Set up Grafana + Prometheus dashboards tracking NVMe usage, replication lag, query P99 latency, and monthly infrastructure cost against $15K ceiling
Define ClickHouse Cloud escape hatch trigger: if ops time exceeds 40% of team capacity for 2 consecutive sprints OR team drops below 2 engineers, initiate migration evaluation
This verdict stops being true when
Budget increases to $30K+/month or Snowflake offers a negotiated enterprise rate below $10/TB compressed storage → Snowflake becomes viable — its managed service model eliminates the operational burnout risk that is ClickHouse's primary failure mode for a 3-person team
Team drops to 2 or fewer engineers, or ops burden exceeds 40% of team capacity for 2+ consecutive sprints → Migrate to ClickHouse Cloud managed service as the pre-planned escape hatch, trading higher cost for reduced operational burden
Ingestion rate drops below 1TB/day due to successful data volume reduction or business scope change → Re-evaluate Snowflake — at lower volumes, storage costs stay within budget and the managed service benefit outweighs ClickHouse operational overhead
Full council reasoning, attack grid, and flip conditions included with Pro
Council notes
Socrates
Vulcan
Daedalus
Loki
Assumptions
- The $15K/month budget is a hard ceiling that cannot be negotiated upward
- 5TB/day ingestion rate is sustained and will not decrease significantly
- The 3-person data engineering team has or can acquire sufficient ClickHouse operational skills within 60 days
- Dashboard query patterns are predictable enough to align with MergeTree ORDER BY keys for sub-second performance
- AWS i3.2xlarge instances remain available and priced at approximately $1,400/month each
Operational signals to watch
reversal — Budget increases to $30K+/month or Snowflake offers a negotiated enterprise rate below $10/TB compressed storage
reversal — Team drops to 2 or fewer engineers, or ops burden exceeds 40% of team capacity for 2+ consecutive sprints
reversal — Ingestion rate drops below 1TB/day due to successful data volume reduction or business scope change
Unresolved uncertainty
- Whether the 3-person team has sufficient ClickHouse operational expertise — if all three are Snowflake/Redshift-experienced with zero ClickHouse exposure, the 60-day ramp-up estimate may be optimistic
- Schema evolution failure mode was cut off in b003's output — the full severity and mitigation for ALTER TABLE blocking on 10B+ row ReplicatedMergeTree tables was not fully articulated
- The $23/TB Snowflake compressed storage pricing is the current list rate; negotiated enterprise pricing could extend Snowflake's budget viability, though likely not enough to overcome the structural gap at 5TB/day
- Query pattern diversity is unknown — sub-100ms assumes dashboard queries align with MergeTree ORDER BY keys. Ad-hoc exploratory queries on non-indexed dimensions may exceed sub-second SLA
- b004's killed argument about Keeper failure during peak ingest causing hours of recovery was not fully rebutted — this is a real operational risk for a 3-person team
Branch battle map
Battle timeline (3 rounds)
Round 1 — Initial positions · 2 branches
Branch b002 (Vulcan) eliminated — Branch b002 is structurally empty — it provides no reco...
Round 2 — Adversarial probes · 3 branches
Loki proposed branch b004
Branch b004 (Loki) eliminated — auto-pruned: unsupported low-confidence branch
Socrates proposed branch b005
Loki
Self-hosted ClickHouse on 3 nodes for 5TB/day begs the question: with only 3 dat…
Socrates
**Split: First reduce data volume before choosing platform.** Before committing …
Round 3 — Final convergence · 2 branches
Branch b005 (Socrates) eliminated — This branch is structurally unsound because it treats a p...