should we replace Redis with Valkey now that Redis changed its license, or stay on Redis for a 200-node deployment handling 2M ops/sec?
This verdict assumes 50% of constraints
Constraint slots are tagged by provenance so synthetic defaults do not look like observed facts:
- synthetic team size synthetic default (not observed): standard team (5-10 engineers) (not_addressed)
- synthetic existing stack synthetic default (not observed): greenfield assumed (not_addressed)
- not_provided connection pooler not provided: not specified (not_addressed)
- not_provided current state not provided: not specified (not_addressed)
- not_provided rollback plan not provided: not specified (not_addressed)
- not_provided data volume not provided: not specified (not_addressed)
Execute a phased canary migration from Redis to Valkey 7.2.x using a dual-write proxy pattern over 4 months
Decision
Execute a phased canary migration from Redis to Valkey 7.2.x over 4 months using a dual-write proxy pattern (Envoy with redis_proxy filter or Twemproxy). Phase 1: Stand up a 20-node Valkey canary (10% of fleet) receiving shadow writes while Redis serves all reads. Phase 2: Shift reads for session cache workload to Valkey canary, validating p99 ≤2ms and cache hit ratio ≥85%. Phase 3: Expand to 100 Valkey nodes at 50% traffic. Phase 4: Full 200-node cutover with Redis kept warm for 2-week rollback. Abort if: Valkey p99 exceeds 3ms, cluster gossip exceeds 100 Mbps, pub/sub latency exceeds 5ms, or more than 2 node failures in any 7-day canary window. Key failure mode: pub/sub at 200 nodes broadcasts to all cluster members — if real-time events exceed 100K messages/sec, internal bandwidth saturates. Mitigation: isolate pub/sub onto a dedicated 16-node cluster. Second failure mode: cluster rebalancing storms from 16,384 hash slots during node topology changes. Budget: $50K total. This avoids the $400K-$600K/year Redis Enterprise licensing cost and the security risk of staying on Redis 7.2 (last Apache-2.0 version) as patches shift to 7.4+.
Inferred specifics
| Value | Kind | Basis | Where introduced |
|---|---|---|---|
| Valkey 7.2 | version | synthetic | chosen_path |
| Redis 7.2 | version | synthetic | chosen_path |
| to 7.4+ | version | synthetic | chosen_path |
| x over 4 months using a dual-write | estimate | synthetic | chosen_path |
| Phase 1: Stand up a 20-node Valkey canary | estimate | synthetic | chosen_path |
| 10% of fleet | threshold | synthetic | chosen_path |
| 2 | estimate | synthetic | chosen_path |
| validating p99 ≤2ms and cache hit ratio ≥85% | threshold | synthetic | chosen_path |
| Phase 3: Expand to 100 Valkey | estimate | synthetic | chosen_path |
| Expand to 100 Valkey nodes at 50% | threshold | synthetic | chosen_path |
| nodes at 50% traffic | threshold | synthetic | chosen_path |
| Abort if: Valkey p99 exceeds 3ms | threshold | synthetic | chosen_path |
| pub/sub latency exceeds 5ms | threshold | synthetic | chosen_path |
| to Valkey 7 | estimate | synthetic | chosen_path |
| events exceed 100K messages/sec | estimate | synthetic | chosen_path |
| Mitigation: isolate pub/sub onto a dedicated 16-node cluster | estimate | synthetic | chosen_path |
| 384 hash slots during node topology changes | estimate | synthetic | chosen_path |
| Budget: $50K total | estimate | synthetic | chosen_path |
| avoids the $400K-$600K/year Redis Enterprise licensing cost | technology | synthetic | chosen_path |
| last Apache-2.0 version | estimate | synthetic | chosen_path |
Highest-probability failure mode: not computed - insufficient evidence in filing to identify with confidence.
Next actions
Verdict-to-Work
Export as markdown
Export as markdown
Export as markdown
Export as markdown
Export as markdown
Export as markdown
Export as markdown
Export as markdown
Export as markdown
Council notes
Evidence boundary
Observed from your filing
- should we replace Redis with Valkey now that Redis changed its license, or stay on Redis for a 200-node deployment handling 2M ops/sec?
Assumptions used for analysis
- Valkey 7.2.x is API-compatible with the Redis commands and data structures currently used across the 200-node deployment — no custom Redis modules or RESP3-specific features that Valkey hasn't forked
- The existing deployment runs Redis 7.2 or earlier (last Apache-2.0 version) and has not yet upgraded to Redis 7.4+ under the new SSPL/RSAL license
- Cloud infrastructure can provision 20 additional nodes for canary without exceeding quota or budget approval timelines
- The 2M ops/sec workload is distributed across session cache, rate limiting, and pub/sub — not a single monolithic use case that cannot be decomposed for phased migration
- Envoy with redis_proxy filter can handle the dual-write throughput at the required proxy layer without becoming a bottleneck itself
- team size synthetic default (not observed): standard team (5-10 engineers) [synthetic] (not_addressed)
- existing stack synthetic default (not observed): greenfield assumed [synthetic] (not_addressed)
- connection pooler not provided: not specified [not_provided] (not_addressed)
- current state not provided: not specified [not_provided] (not_addressed)
- rollback plan not provided: not specified [not_provided] (not_addressed)
- data volume not provided: not specified [not_provided] (not_addressed)
Inferred candidate specifics
- Execute a phased canary migration from Redis to Valkey 7.2.x over 4 months using a dual-write proxy pattern (Envoy with redis_proxy filter or Twemproxy). Phase 1: Stand up a 20-node Valkey canary (10% of fleet) receiving shadow writes while Redis serves all reads. Phase 2: Shift reads for session cache workload to Valkey canary, validating p99 ≤2ms and cache hit ratio ≥85%. Phase 3: Expand to 100 Valkey nodes at 50% traffic. Phase 4: Full 200-node cutover with Redis kept warm for 2-week rollback. Abort if: Valkey p99 exceeds 3ms, cluster gossip exceeds 100 Mbps, pub/sub latency exceeds 5ms, or more than 2 node failures in any 7-day canary window. Key failure mode: pub/sub at 200 nodes broadcasts to all cluster members — if real-time events exceed 100K messages/sec, internal bandwidth saturates. Mitigation: isolate pub/sub onto a dedicated 16-node cluster. Second failure mode: cluster rebalancing storms from 16,384 hash slots during node topology changes. Budget: $50K total. This avoids the $400K-$600K/year Redis Enterprise licensing cost and the security risk of staying on Redis 7.2 (last Apache-2.0 version) as patches shift to 7.4+.
- Deploy a 20-node Valkey 7.2.6 canary cluster in the same availability zone as the existing Redis deployment, configure Envoy with redis_proxy filter for dual-write from 10% of the production write path, and instrument Prometheus/Grafana dashboards tracking p99 latency, gossip bandwidth, pub/sub delivery latency, and node failure rate against the four abort thresholds.
- b003 had the highest confidence (0.90) among surviving branches, survived 3 rounds of adversarial challenge including a direct attack on dual-write feasibility (b004, killed), and provided the most concrete architecture: named proxy technology (Envoy redis_proxy), specific phase timeline, quantified abort thresholds, named failure modes with mitigations, and a budget breakdown. b002 (0.70) was a strictly weaker version of the same recommendation without the specificity.
- Hybrid architecture with Valkey at edge and commercial caching (ElastiCache) for critical workloads
- Architecturally incoherent — ElastiCache IS Redis/Valkey under the hood. Introduced cache coherence problems at 2M ops/sec without naming a consistency protocol. Claimed p99 of 1.5ms while adding a synchronization layer, violating basic latency math. Fabricated budget constraints.
- Treat as a legal/contractual issue, negotiate commercial Redis license before any migration
- SSPL/RSAL is a blanket license change, not negotiable per-customer. Redis Enterprise for 200 nodes would cost $400K-$600K/year vs. $50K one-time migration. Backup options (KeyDB unmaintained since 2022, DragonflyDB uses BSL 1.1) have the same or worse license problems. Delay accumulates unpatched CVE exposure on Redis 7.2.
- Reject dual-write as introducing insurmountable consistency risks and >10ms p99 spikes
Unknowns blocking a firmer verdict
- Valkey 7.2.x cluster behavior at exactly 200 nodes is not widely benchmarked in public literature — the gossip bandwidth and rebalancing storm thresholds are engineering estimates, not production-validated numbers at this specific scale
- b003's budget of $50K is a rough estimate — actual costs depend heavily on cloud provider, instance types, and whether reserved/spot pricing is available for the canary phase
- The pub/sub 100K messages/sec threshold for bandwidth saturation is model-derived, not benchmarked against Valkey's specific cluster broadcast implementation
- Redis 7.2 security patch timeline is uncertain — Redis Ltd may continue critical CVE patches longer than expected, or may not
- b004 (killed) raised a valid concern about dual-write consistency during network partitions that b003 addresses only via abort thresholds, not via a formal consistency protocol
Operational signals to watch
Branch battle map
Battle timeline (3 rounds)
Evidence source proof
evidence source proof not available for legacy verdicts pre-2026-05-20