Execute a phased canary migration from Redis to Valkey 7.2.x using a dual-write proxy pattern over 4 months
Decision
82%
Execution
—
Uncertainty
—
This verdict was re-examined after censor review.
Decision
Execute a phased canary migration from Redis to Valkey 7.2.x over 4 months using a dual-write proxy pattern (Envoy with redis_proxy filter or Twemproxy). Phase 1: Stand up a 20-node Valkey canary (10% of fleet) receiving shadow writes while Redis serves all reads. Phase 2: Shift reads for session cache workload to Valkey canary, validating p99 ≤2ms and cache hit ratio ≥85%. Phase 3: Expand to 100 Valkey nodes at 50% traffic. Phase 4: Full 200-node cutover with Redis kept warm for 2-week rollback.
Abort if: Valkey p99 exceeds 3ms, cluster gossip exceeds 100 Mbps, pub/sub latency exceeds 5ms, or more than 2 node failures in any 7-day canary window.
Key failure mode: pub/sub at 200 nodes broadcasts to all cluster members — if real-time events exceed 100K messages/sec, internal bandwidth saturates. Mitigation: isolate pub/sub onto a dedicated 16-node cluster. Second failure mode: cluster rebalancing storms from 16,384 hash slots during node topology changes.
Budget: $50K total. This avoids the $400K-$600K/year Redis Enterprise licensing cost and the security risk of staying on Redis 7.2 (last Apache-2.0 version) as patches shift to 7.4+.
Next actions
Deploy 20-node Valkey 7.2.6 canary cluster with identical configuration to current Redis nodes (maxmemory policy, persistence settings, cluster-enabled yes)
infra · immediate
Configure Envoy proxy with redis_proxy filter for dual-write, routing 10% of write traffic to Valkey canary while Redis continues serving 100% of reads
infra · immediate
Build Prometheus/Grafana dashboards tracking the four abort thresholds: p99 >3ms, gossip >100 Mbps, pub/sub >5ms, >2 node failures/7 days, plus cache hit ratio ≥85%
infra · immediate
Run canary for 2 weeks under shadow write load, comparing Valkey p99/p999 latency distributions against Redis baseline at equivalent traffic volume
backend · immediate
Benchmark Valkey pub/sub message throughput on the 20-node canary to validate the 100K messages/sec threshold before Phase 2 read shifting
backend · before_launch
At end of Phase 1 (Month 2), evaluate canary metrics against abort thresholds and decide whether to proceed to Phase 2 read shifting or abort migration
infra · before_launch
This verdict stops being true when
Valkey canary fails abort thresholds during Phase 1 (p99 >3ms sustained, gossip >100 Mbps, or >2 node failures) and root cause is a fundamental Valkey architectural limitation rather than configuration → Negotiate Redis Enterprise commercial license despite the $400K-$600K/year cost, or evaluate DragonflyDB if BSL 1.1 is legally acceptable for the organization
Redis Ltd reverses the SSPL/RSAL license change or creates a permissive-use exemption for self-hosted non-competing deployments → Stay on Redis, upgrade to latest version, cancel migration
Pub/sub workload exceeds 100K messages/sec and cannot be isolated onto a dedicated cluster due to application coupling, causing cluster bandwidth saturation at scale → Migrate session/rate-limiting to Valkey but move pub/sub workload to a dedicated message broker (Kafka, NATS) rather than running it on Valkey cluster
Full council reasoning, attack grid, and flip conditions included with Pro
Council notes
Socrates
RECOMMENDATION: Treat the Redis license change as a legal/contractual issue, not a technical one. Before committing t...
Vulcan
Explore the technical and operational feasibility of migrating the 200-node Redis deployment to Valkey, focusing on m...
Daedalus
RECOMMENDATION: Execute a phased canary migration from Redis to Valkey 7.2.x using a dual-write proxy pattern over 4 ...
Loki
A dual-write proxy pattern on 200 nodes at 2M ops/sec introduces inevitable consistency risks from out-of-order deliv...
Assumptions
Valkey 7.2.x is API-compatible with the Redis commands and data structures currently used across the 200-node deployment — no custom Redis modules or RESP3-specific features that Valkey hasn't forked
The existing deployment runs Redis 7.2 or earlier (last Apache-2.0 version) and has not yet upgraded to Redis 7.4+ under the new SSPL/RSAL license
Cloud infrastructure can provision 20 additional nodes for canary without exceeding quota or budget approval timelines
The 2M ops/sec workload is distributed across session cache, rate limiting, and pub/sub — not a single monolithic use case that cannot be decomposed for phased migration
Envoy with redis_proxy filter can handle the dual-write throughput at the required proxy layer without becoming a bottleneck itself
Operational signals to watch
reversal — Valkey canary fails abort thresholds during Phase 1 (p99 >3ms sustained, gossip >100 Mbps, or >2 node failures) and root cause is a fundamental Valkey architectural limitation rather than configuration
reversal — Redis Ltd reverses the SSPL/RSAL license change or creates a permissive-use exemption for self-hosted non-competing deployments
reversal — Pub/sub workload exceeds 100K messages/sec and cannot be isolated onto a dedicated cluster due to application coupling, causing cluster bandwidth saturation at scale
Unresolved uncertainty
Valkey 7.2.x cluster behavior at exactly 200 nodes is not widely benchmarked in public literature — the gossip bandwidth and rebalancing storm thresholds are engineering estimates, not production-validated numbers at this specific scale
b003's budget of $50K is a rough estimate — actual costs depend heavily on cloud provider, instance types, and whether reserved/spot pricing is available for the canary phase
The pub/sub 100K messages/sec threshold for bandwidth saturation is model-derived, not benchmarked against Valkey's specific cluster broadcast implementation
Redis 7.2 security patch timeline is uncertain — Redis Ltd may continue critical CVE patches longer than expected, or may not
b004 (killed) raised a valid concern about dual-write consistency during network partitions that b003 addresses only via abort thresholds, not via a formal consistency protocol
Branch battle map
Battle timeline (3 rounds)
Round 1 — Initial positions · 2 branches
Branch b001 (Socrates) eliminated — This branch has fundamental structural problems that make...