should we replace Redis with Valkey now that Redis changed its license, or stay on Redis for a 200-node deployment handling 2M ops/sec?

accepted_conditional Release validationsoftware_operational · Pro · 629s · $0.77

Viewing as: Open

This verdict assumes 50% of constraints

Constraint slots are tagged by provenance so synthetic defaults do not look like observed facts:

synthetic team size synthetic default (not observed): standard team (5-10 engineers) (not_addressed)
synthetic existing stack synthetic default (not observed): greenfield assumed (not_addressed)
not_provided connection pooler not provided: not specified (not_addressed)
not_provided current state not provided: not specified (not_addressed)
not_provided rollback plan not provided: not specified (not_addressed)
not_provided data volume not provided: not specified (not_addressed)

5 branches explored · 2 survived · 3 rounds · integrity 75%

WeakStrong

Candidate estimate (inferred)

Risk unknown 629s

Read brief Open timeline MD ↓ Pro JSON ↓ Pro PDF ↓ Ent

Decision timeline Verdict

Execute a phased canary migration from Redis to Valkey 7.2.x using a dual-write proxy pattern over 4 months

Decision

82%

Execution

—

Uncertainty

—

Decision

Concrete components, topology, and thresholds named below are candidate mitigations or example implementations inferred by the Council. They were not confirmed in your filing or established as part of your current environment.

Execute a phased canary migration from Redis to Valkey 7.2.x over 4 months using a dual-write proxy pattern (Envoy with redis_proxy filter or Twemproxy). Phase 1: Stand up a 20-node Valkey canary (10% of fleet) receiving shadow writes while Redis serves all reads. Phase 2: Shift reads for session cache workload to Valkey canary, validating p99 ≤2ms and cache hit ratio ≥85%. Phase 3: Expand to 100 Valkey nodes at 50% traffic. Phase 4: Full 200-node cutover with Redis kept warm for 2-week rollback. Abort if: Valkey p99 exceeds 3ms, cluster gossip exceeds 100 Mbps, pub/sub latency exceeds 5ms, or more than 2 node failures in any 7-day canary window. Key failure mode: pub/sub at 200 nodes broadcasts to all cluster members — if real-time events exceed 100K messages/sec, internal bandwidth saturates. Mitigation: isolate pub/sub onto a dedicated 16-node cluster. Second failure mode: cluster rebalancing storms from 16,384 hash slots during node topology changes. Budget: $50K total. This avoids the $400K-$600K/year Redis Enterprise licensing cost and the security risk of staying on Redis 7.2 (last Apache-2.0 version) as patches shift to 7.4+.

Inferred specifics

Structured audit rows for Council-added details. Synthetic basis means the detail was introduced by analysis, not supplied by the filing.

Value	Kind	Basis	Where introduced
Valkey 7.2	version	synthetic	chosen_path
Redis 7.2	version	synthetic	chosen_path
to 7.4+	version	synthetic	chosen_path
x over 4 months using a dual-write	estimate	synthetic	chosen_path
Phase 1: Stand up a 20-node Valkey canary	estimate	synthetic	chosen_path
10% of fleet	threshold	synthetic	chosen_path
2	estimate	synthetic	chosen_path
validating p99 ≤2ms and cache hit ratio ≥85%	threshold	synthetic	chosen_path
Phase 3: Expand to 100 Valkey	estimate	synthetic	chosen_path
Expand to 100 Valkey nodes at 50%	threshold	synthetic	chosen_path
nodes at 50% traffic	threshold	synthetic	chosen_path
Abort if: Valkey p99 exceeds 3ms	threshold	synthetic	chosen_path
pub/sub latency exceeds 5ms	threshold	synthetic	chosen_path
to Valkey 7	estimate	synthetic	chosen_path
events exceed 100K messages/sec	estimate	synthetic	chosen_path
Mitigation: isolate pub/sub onto a dedicated 16-node cluster	estimate	synthetic	chosen_path
384 hash slots during node topology changes	estimate	synthetic	chosen_path
Budget: $50K total	estimate	synthetic	chosen_path
avoids the $400K-$600K/year Redis Enterprise licensing cost	technology	synthetic	chosen_path
last Apache-2.0 version	estimate	synthetic	chosen_path

Highest-probability failure mode: not computed - insufficient evidence in filing to identify with confidence.

Next actions

Candidate estimate (inferred, not source-confirmed): Deploy 20-node Valkey 7.2.6 canary cluster with identical configuration to current Redis nodes (maxmemory policy, persistence settings, cluster-enabled yes)

infra · immediate

Candidate estimate (inferred, not source-confirmed): Configure Envoy proxy with redis_proxy filter for dual-write, routing 10% of write traffic to Valkey canary while Redis continues serving 100% of reads

infra · immediate

Candidate estimate (inferred, not source-confirmed): Build Prometheus/Grafana dashboards tracking the four abort thresholds: p99 >3ms, gossip >100 Mbps, pub/sub >5ms, >2 node failures/7 days, plus cache hit ratio ≥85%

infra · immediate

Candidate estimate (inferred, not source-confirmed): Run canary for 2 weeks under shadow write load, comparing Valkey p99/p999 latency distributions against Redis baseline at equivalent traffic volume

backend · immediate

Candidate estimate (inferred, not source-confirmed): Benchmark Valkey pub/sub message throughput on the 20-node canary to validate the 100K messages/sec threshold before Phase 2 read shifting

backend · before_launch

Candidate estimate (inferred, not source-confirmed): At end of Phase 1 (Month 2), evaluate canary metrics against abort thresholds and decide whether to proceed to Phase 2 read shifting or abort migration

infra · before_launch

Verdict-to-Work

A model gives you advice. VectorCourt turns the verdict into accountable work.

Valkey canary fails abort thresholds during Phase 1 (p99 >3ms sustained, gossip >100 Mbps, or >2 node failures) and root cause is a funda...

Reversal condition · observed · investigation_wo

Create investigation WO

Negotiate Redis Enterprise commercial license despite the $400K-$600K/year cost, or evaluate DragonflyDB if BSL 1.1 is legally acceptable for the organization

Evidence boundary: condition flips verdict when observed

Export as markdown

### Valkey canary fails abort thresholds during Phase 1 (p99 >3ms sustained, gossip >100 Mbps, or >2 node failures) and root cause is a funda...

- Finding ID: `reversal_condition:1_valkey_canary_fails_abort_thresholds_during_phase_1__p99__3ms_sustained__gossip__100_mbps__or__2_node_failu`
- Subtype: `reversal_condition`
- Evidence status: `observed`
- Default work type: `investigation_wo`
- Summary: Negotiate Redis Enterprise commercial license despite the $400K-$600K/year cost, or evaluate DragonflyDB if BSL 1.1 is legally acceptable for the organization
- Evidence boundary: condition flips verdict when observed
- Reversal condition: Valkey canary fails abort thresholds during Phase 1 (p99 >3ms sustained, gossip >100 Mbps, or >2 node failures) and root cause is a fundamental Valkey architectural limitation rather than configuration

Acceptance criteria:
- Root cause or measurement plan is identified for the reversal condition.
- Evidence status remains marked synthetic until measured.
- Follow-up implementation work is created only after evidence is observed.

Upgrade to Pro to create governed work from this finding.

Redis Ltd reverses the SSPL/RSAL license change or creates a permissive-use exemption for self-hosted non-competing deployments

Reversal condition · observed · investigation_wo

Create investigation WO

Stay on Redis, upgrade to latest version, cancel migration

Evidence boundary: condition flips verdict when observed

Export as markdown

### Redis Ltd reverses the SSPL/RSAL license change or creates a permissive-use exemption for self-hosted non-competing deployments

- Finding ID: `reversal_condition:2_redis_ltd_reverses_the_sspl_rsal_license_change_or_creates_a_permissive-use_exemption_for_self-hosted_non-c`
- Subtype: `reversal_condition`
- Evidence status: `observed`
- Default work type: `investigation_wo`
- Summary: Stay on Redis, upgrade to latest version, cancel migration
- Evidence boundary: condition flips verdict when observed
- Reversal condition: Redis Ltd reverses the SSPL/RSAL license change or creates a permissive-use exemption for self-hosted non-competing deployments

Acceptance criteria:
- Root cause or measurement plan is identified for the reversal condition.
- Evidence status remains marked synthetic until measured.
- Follow-up implementation work is created only after evidence is observed.

Upgrade to Pro to create governed work from this finding.

Pub/sub workload exceeds 100K messages/sec and cannot be isolated onto a dedicated cluster due to application coupling, causing cluster b...

Reversal condition · observed · investigation_wo

Create investigation WO

Migrate session/rate-limiting to Valkey but move pub/sub workload to a dedicated message broker (Kafka, NATS) rather than running it on Valkey cluster

Evidence boundary: condition flips verdict when observed

Export as markdown

### Pub/sub workload exceeds 100K messages/sec and cannot be isolated onto a dedicated cluster due to application coupling, causing cluster b...

- Finding ID: `reversal_condition:3_pub_sub_workload_exceeds_100k_messages_sec_and_cannot_be_isolated_onto_a_dedicated_cluster_due_to_applicati`
- Subtype: `reversal_condition`
- Evidence status: `observed`
- Default work type: `investigation_wo`
- Summary: Migrate session/rate-limiting to Valkey but move pub/sub workload to a dedicated message broker (Kafka, NATS) rather than running it on Valkey cluster
- Evidence boundary: condition flips verdict when observed
- Reversal condition: Pub/sub workload exceeds 100K messages/sec and cannot be isolated onto a dedicated cluster due to application coupling, causing cluster bandwidth saturation at scale

Acceptance criteria:
- Root cause or measurement plan is identified for the reversal condition.
- Evidence status remains marked synthetic until measured.
- Follow-up implementation work is created only after evidence is observed.

Upgrade to Pro to create governed work from this finding.

Deploy 20-node Valkey 7.2.6 canary cluster with identical configuration to current Redis nodes (maxmemory policy, persistence settings, c...

Repair action · observed · repair_wo

Create repair WO

implement

Evidence boundary: infra

Export as markdown

### Deploy 20-node Valkey 7.2.6 canary cluster with identical configuration to current Redis nodes (maxmemory policy, persistence settings, c...

- Finding ID: `repair_action:1_deploy_20-node_valkey_7_2_6_canary_cluster_with_identical_configuration_to_current_redis_nodes__maxmemory_policy`
- Subtype: `repair_action`
- Evidence status: `observed`
- Default work type: `repair_wo`
- Summary: implement
- Evidence boundary: infra

Acceptance criteria:
- The repair is implemented with deterministic verification.
- The source verdict is linked for revalidation.

Upgrade to Pro to create governed work from this finding.

Configure Envoy proxy with redis_proxy filter for dual-write, routing 10% of write traffic to Valkey canary while Redis continues serving...

Repair action · observed · repair_wo

Create repair WO

implement

Evidence boundary: infra

Export as markdown

### Configure Envoy proxy with redis_proxy filter for dual-write, routing 10% of write traffic to Valkey canary while Redis continues serving...

- Finding ID: `repair_action:2_configure_envoy_proxy_with_redis_proxy_filter_for_dual-write__routing_10__of_write_traffic_to_valkey_canary_whil`
- Subtype: `repair_action`
- Evidence status: `observed`
- Default work type: `repair_wo`
- Summary: implement
- Evidence boundary: infra

Acceptance criteria:
- The repair is implemented with deterministic verification.
- The source verdict is linked for revalidation.

Upgrade to Pro to create governed work from this finding.

Build Prometheus/Grafana dashboards tracking the four abort thresholds: p99 >3ms, gossip >100 Mbps, pub/sub >5ms, >2 node failures/7 days...

Repair action · observed · repair_wo

Create repair WO

monitor

Evidence boundary: infra

Export as markdown

### Build Prometheus/Grafana dashboards tracking the four abort thresholds: p99 >3ms, gossip >100 Mbps, pub/sub >5ms, >2 node failures/7 days...

- Finding ID: `repair_action:3_build_prometheus_grafana_dashboards_tracking_the_four_abort_thresholds:_p99__3ms__gossip__100_mbps__pub_sub__5ms`
- Subtype: `repair_action`
- Evidence status: `observed`
- Default work type: `repair_wo`
- Summary: monitor
- Evidence boundary: infra

Acceptance criteria:
- The repair is implemented with deterministic verification.
- The source verdict is linked for revalidation.

Upgrade to Pro to create governed work from this finding.

Run canary for 2 weeks under shadow write load, comparing Valkey p99/p999 latency distributions against Redis baseline at equivalent traf...

Repair action · observed · repair_wo

Create repair WO

validate

Evidence boundary: backend

Export as markdown

### Run canary for 2 weeks under shadow write load, comparing Valkey p99/p999 latency distributions against Redis baseline at equivalent traf...

- Finding ID: `repair_action:4_run_canary_for_2_weeks_under_shadow_write_load__comparing_valkey_p99_p999_latency_distributions_against_redis_ba`
- Subtype: `repair_action`
- Evidence status: `observed`
- Default work type: `repair_wo`
- Summary: validate
- Evidence boundary: backend

Acceptance criteria:
- The repair is implemented with deterministic verification.
- The source verdict is linked for revalidation.

Upgrade to Pro to create governed work from this finding.

Benchmark Valkey pub/sub message throughput on the 20-node canary to validate the 100K messages/sec threshold before Phase 2 read shifting

Repair action · observed · repair_wo

Create repair WO

investigate

Evidence boundary: backend

Export as markdown

### Benchmark Valkey pub/sub message throughput on the 20-node canary to validate the 100K messages/sec threshold before Phase 2 read shifting

- Finding ID: `repair_action:5_benchmark_valkey_pub_sub_message_throughput_on_the_20-node_canary_to_validate_the_100k_messages_sec_threshold_be`
- Subtype: `repair_action`
- Evidence status: `observed`
- Default work type: `repair_wo`
- Summary: investigate
- Evidence boundary: backend

Acceptance criteria:
- The repair is implemented with deterministic verification.
- The source verdict is linked for revalidation.

Upgrade to Pro to create governed work from this finding.

At end of Phase 1 (Month 2), evaluate canary metrics against abort thresholds and decide whether to proceed to Phase 2 read shifting or a...

Repair action · observed · repair_wo

Create repair WO

decide

Evidence boundary: infra

Export as markdown

### At end of Phase 1 (Month 2), evaluate canary metrics against abort thresholds and decide whether to proceed to Phase 2 read shifting or a...

- Finding ID: `repair_action:6_at_end_of_phase_1__month_2__evaluate_canary_metrics_against_abort_thresholds_and_decide_whether_to_proceed_to_ph`
- Subtype: `repair_action`
- Evidence status: `observed`
- Default work type: `repair_wo`
- Summary: decide
- Evidence boundary: infra

Acceptance criteria:
- The repair is implemented with deterministic verification.
- The source verdict is linked for revalidation.

Upgrade to Pro to create governed work from this finding.

This verdict stops being true when

Candidate estimate (inferred, not source-confirmed): Valkey canary fails abort thresholds during Phase 1 (p99 >3ms sustained, gossip >100 Mbps, or >2 node failures) and root cause is a fundamental Valkey architectural limitation rather than configuration → Candidate estimate (inferred, not source-confirmed): Negotiate Redis Enterprise commercial license despite the $400K-$600K/year cost, or evaluate DragonflyDB if BSL 1.1 is legally acceptable for the organization

Redis Ltd reverses the SSPL/RSAL license change or creates a permissive-use exemption for self-hosted non-competing deployments → Stay on Redis, upgrade to latest version, cancel migration

Candidate estimate (inferred, not source-confirmed): Pub/sub workload exceeds 100K messages/sec and cannot be isolated onto a dedicated cluster due to application coupling, causing cluster bandwidth saturation at scale → Migrate session/rate-limiting to Valkey but move pub/sub workload to a dedicated message broker (Kafka, NATS) rather than running it on Valkey cluster

Full council reasoning, attack grid, and flip conditions included with Pro

Council notes

Socrates

RECOMMENDATION: Treat the Redis license change as a legal/contractual issue, not a technical one. Before committing t...

Vulcan

Explore the technical and operational feasibility of migrating the 200-node Redis deployment to Valkey, focusing on m...

Daedalus

RECOMMENDATION: Execute a phased canary migration from Redis to Valkey 7.2.x using a dual-write proxy pattern over 4 ...

Loki

A dual-write proxy pattern on 200 nodes at 2M ops/sec introduces inevitable consistency risks from out-of-order deliv...

Evidence boundary

Observed from your filing

should we replace Redis with Valkey now that Redis changed its license, or stay on Redis for a 200-node deployment handling 2M ops/sec?

Assumptions used for analysis

Valkey 7.2.x is API-compatible with the Redis commands and data structures currently used across the 200-node deployment — no custom Redis modules or RESP3-specific features that Valkey hasn't forked
The existing deployment runs Redis 7.2 or earlier (last Apache-2.0 version) and has not yet upgraded to Redis 7.4+ under the new SSPL/RSAL license
Cloud infrastructure can provision 20 additional nodes for canary without exceeding quota or budget approval timelines
The 2M ops/sec workload is distributed across session cache, rate limiting, and pub/sub — not a single monolithic use case that cannot be decomposed for phased migration
Envoy with redis_proxy filter can handle the dual-write throughput at the required proxy layer without becoming a bottleneck itself
team size synthetic default (not observed): standard team (5-10 engineers) [synthetic] (not_addressed)
existing stack synthetic default (not observed): greenfield assumed [synthetic] (not_addressed)
connection pooler not provided: not specified [not_provided] (not_addressed)
current state not provided: not specified [not_provided] (not_addressed)
rollback plan not provided: not specified [not_provided] (not_addressed)
data volume not provided: not specified [not_provided] (not_addressed)

Inferred candidate specifics

These details were introduced by the Council during analysis. They were not supplied in your filing.

Execute a phased canary migration from Redis to Valkey 7.2.x over 4 months using a dual-write proxy pattern (Envoy with redis_proxy filter or Twemproxy). Phase 1: Stand up a 20-node Valkey canary (10% of fleet) receiving shadow writes while Redis serves all reads. Phase 2: Shift reads for session cache workload to Valkey canary, validating p99 ≤2ms and cache hit ratio ≥85%. Phase 3: Expand to 100 Valkey nodes at 50% traffic. Phase 4: Full 200-node cutover with Redis kept warm for 2-week rollback. Abort if: Valkey p99 exceeds 3ms, cluster gossip exceeds 100 Mbps, pub/sub latency exceeds 5ms, or more than 2 node failures in any 7-day canary window. Key failure mode: pub/sub at 200 nodes broadcasts to all cluster members — if real-time events exceed 100K messages/sec, internal bandwidth saturates. Mitigation: isolate pub/sub onto a dedicated 16-node cluster. Second failure mode: cluster rebalancing storms from 16,384 hash slots during node topology changes. Budget: $50K total. This avoids the $400K-$600K/year Redis Enterprise licensing cost and the security risk of staying on Redis 7.2 (last Apache-2.0 version) as patches shift to 7.4+.
Deploy a 20-node Valkey 7.2.6 canary cluster in the same availability zone as the existing Redis deployment, configure Envoy with redis_proxy filter for dual-write from 10% of the production write path, and instrument Prometheus/Grafana dashboards tracking p99 latency, gossip bandwidth, pub/sub delivery latency, and node failure rate against the four abort thresholds.
b003 had the highest confidence (0.90) among surviving branches, survived 3 rounds of adversarial challenge including a direct attack on dual-write feasibility (b004, killed), and provided the most concrete architecture: named proxy technology (Envoy redis_proxy), specific phase timeline, quantified abort thresholds, named failure modes with mitigations, and a budget breakdown. b002 (0.70) was a strictly weaker version of the same recommendation without the specificity.
Hybrid architecture with Valkey at edge and commercial caching (ElastiCache) for critical workloads
Architecturally incoherent — ElastiCache IS Redis/Valkey under the hood. Introduced cache coherence problems at 2M ops/sec without naming a consistency protocol. Claimed p99 of 1.5ms while adding a synchronization layer, violating basic latency math. Fabricated budget constraints.
Treat as a legal/contractual issue, negotiate commercial Redis license before any migration
SSPL/RSAL is a blanket license change, not negotiable per-customer. Redis Enterprise for 200 nodes would cost $400K-$600K/year vs. $50K one-time migration. Backup options (KeyDB unmaintained since 2022, DragonflyDB uses BSL 1.1) have the same or worse license problems. Delay accumulates unpatched CVE exposure on Redis 7.2.
Reject dual-write as introducing insurmountable consistency risks and >10ms p99 spikes

Unknowns blocking a firmer verdict

Valkey 7.2.x cluster behavior at exactly 200 nodes is not widely benchmarked in public literature — the gossip bandwidth and rebalancing storm thresholds are engineering estimates, not production-validated numbers at this specific scale
b003's budget of $50K is a rough estimate — actual costs depend heavily on cloud provider, instance types, and whether reserved/spot pricing is available for the canary phase
The pub/sub 100K messages/sec threshold for bandwidth saturation is model-derived, not benchmarked against Valkey's specific cluster broadcast implementation
Redis 7.2 security patch timeline is uncertain — Redis Ltd may continue critical CVE patches longer than expected, or may not
b004 (killed) raised a valid concern about dual-write consistency during network partitions that b003 addresses only via abort thresholds, not via a formal consistency protocol

Operational signals to watch

reversal — Candidate estimate (inferred, not source-confirmed): Valkey canary fails abort thresholds during Phase 1 (p99 >3ms sustained, gossip >100 Mbps, or >2 node failures) and root cause is a fundamental Valkey architectural limitation rather than configuration

reversal — Redis Ltd reverses the SSPL/RSAL license change or creates a permissive-use exemption for self-hosted non-competing deployments

reversal — Candidate estimate (inferred, not source-confirmed): Pub/sub workload exceeds 100K messages/sec and cannot be isolated onto a dedicated cluster due to application coupling, causing cluster bandwidth saturation at scale

Branch battle map

Battle timeline (3 rounds)

Round 1 — Initial positions · 2 branches

Branch b001 (Socrates) eliminated — This branch has fundamental structural problems that make...

Round 2 — Adversarial probes · 3 branches

Loki proposed branch b004

Branch b004 (Loki) eliminated — Branch b004 posits that dual-write introduces insurmounta...

Socrates proposed branch b005

Loki A dual-write proxy pattern on 200 nodes at 2M ops/sec introduces inevitable cons…

Socrates RECOMMENDATION: Treat the Redis license change as a legal/contractual issue, not…

Round 3 — Final convergence · 2 branches

Branch b005 (Socrates) eliminated — This branch has a fatal structural flaw: it treats the Re...

Evidence source proof

evidence source proof not available for legacy verdicts pre-2026-05-20

Markdown JSON

Council chamber

Socrates

Analyst

Vulcan

Engineer

Daedalus

Architect

Loki

Disruptor

5851eba9-8d79-4bab-9a09-6e2e22ae5b37 · Protocol

Council archetypes represent independent reasoning perspectives. They are not individuals but structured reasoning roles.

VectorCourt processes filings through approved AI providers; per-verdict model routing is disclosed in Enterprise audit exports.

This verdict is a structured reasoning artifact, not professional advice. VectorCourt does not provide legal, financial, medical, or other professional advice. You are responsible for your own decisions.

VectorCourt · Pricing · Terms · Privacy · Refund Policy · Clerk, not judge