What breaks if we switch from REST to gRPC for all internal services?

provisional risk_surface · Pro · 384s · $0.58
Unverified thresholds detected Overconfidence flagged Council oversight flagged for review High number of unresolved uncertainties
6 branches explored · 2 survived · 4 rounds · integrity 100%
45% confidence
WeakStrong
Adopt a tiered migration strategy: classify internal services into performance-critical (Class A → gRPC within 6...)
Risk low 384s
Decision timeline Verdict

Adopt a tiered migration strategy: classify internal services into performance-critical (Class A → gRPC within 6 mo...)

Decision
50%
Execution
high
Uncertainty
high
Reasoning
57%
Evidence
46%
Stability
48%

Decision

Adopt a tiered migration strategy: classify internal services into performance-critical (Class A → gRPC within 6 months), moderate-performance (Class B → hybrid REST/gRPC), and integration-heavy (Class C → remain REST for 12+ months), rather than switching all services to gRPC simultaneously.. Because a blanket REST-to-gRPC migration breaks browser client compatibility, eliminates HTTP caching infrastructure, disrupts debugging workflows (curl, Postman, browser DevTools), requires protobuf schema management overhead, and forces team reskilling simultaneously across all services — tiered classification isolates these breakage points to manageable batches while capturing performance gains where they matter most (services requiring <50ms response time).. Key failure modes: Inconsistent service boundaries causing increased cognitive load for developers maintaining both communication patterns; Premature optimization of low-traffic services consuming resources that could be allocated to actual performance bottlenecks; Misclassification of services leading to wrong protocol choice — e.g., a Class C service that actually has latency-sensitive internal callers. Thresholds: Response time < 50ms for Class A services, Class A migration within 6 months, Class C remains on REST for 12+ months

Next actions

Build a service inventory with measured RPS, p99 latency, consumer count, and REST-specific dependencies (caching, debugging tools, load balancer configs) for every internal service
backend · immediate
Define quantitative classification criteria for Class A/B/C based on the inventory data — specific RPS thresholds, latency requirements, and external integration counts
backend · immediate
Run a proof-of-concept gRPC migration on one Class A service with Envoy transcoding sidecar, measuring actual p99 latency improvement and developer onboarding time
backend · before_launch
Set up a shared protobuf registry (Buf Schema Registry) with CI-enforced breaking-change detection before any service begins migration
infra · before_launch
Track developer cognitive load metrics (context-switch frequency, incident rate per protocol type, onboarding time for new team members) throughout migration to detect if dual-paradigm maintenance is degrading velocity
backend · ongoing
This verdict stops being true when
Benchmarking reveals REST+HTTP/3 with compression closes the latency gap to within 10% of gRPC for the organization's actual payload sizes and call patterns → Stay on REST, invest in HTTP/3 migration and JSON Schema enforcement instead of gRPC migration
Service inventory reveals fewer than 3 services meeting Class A criteria (truly latency-sensitive internal-only services) → Do not migrate — the operational cost of introducing gRPC exceeds the performance benefit for a small number of qualifying services
The team is building greenfield services or has already committed to a full rewrite → Adopt gRPC uniformly for all internal services from the start, avoiding dual-paradigm complexity entirely

What usually goes wrong

Full council reasoning, attack grid, and flip conditions included with Pro

Council notes

Vulcan
Adopt a hybrid strategy: Retain REST for low-priority and less performance-critical services but incrementally migrat...
Socrates
Reframe the problem: Instead of asking what communication protocol to use, ask whether internal services should expos...
Daedalus
Reject both alternatives as stated. Adopt a phased migration to gRPC using Envoy sidecar proxies with gRPC-JSON trans...
Loki
What if the opposite were true? What *improves* if we optimize REST with HTTP/3, compression, and JSON Schema instead...

Attack grid Survival rate shows how the recommendation holds under stress scenarios. Low scores indicate conditional vulnerability, not a flaw in the recommendation.

8/8 scenarios survived

architecture
2/2 (100%)
operations
2/2 (100%)
security
2/2 (100%)
environmental
2/2 (100%)
Scenario detail (8)
✓ latency_impact
cross-region latency added
✓ failure_cascade
single component failure triggers 3 downstream failures
✓ operational_complexity
key person leaves
✓ human_factors
cognitive overload from alert fatigue
✓ compliance_requirements
SOC2 certification required
✓ security_surface
zero-day in primary dependency
✓ geopolitical_risk
sanctions imposed on primary cloud provider's operating country
✓ cost_trajectory
usage 5× projection

Assumptions

  • The organization runs a microservices architecture with multiple internal services communicating synchronously over REST today
  • There exist measurable performance differences between REST/JSON and gRPC/protobuf for the organization's actual payload sizes and call patterns
  • The engineering team has capacity to maintain two communication paradigms simultaneously during a multi-month transition
  • Service classification into A/B/C tiers can be done objectively based on measurable metrics rather than political negotiation
  • The organization's infrastructure (load balancers, service mesh, API gateways, observability stack) can support gRPC — specifically HTTP/2 end-to-end

Fragility signals

  • Hubris: ANNOTATE

Operational signals to watch

reversal — Benchmarking reveals REST+HTTP/3 with compression closes the latency gap to within 10% of gRPC for the organization's actual payload sizes and call patterns
reversal — Service inventory reveals fewer than 3 services meeting Class A criteria (truly latency-sensitive internal-only services)
reversal — The team is building greenfield services or has already committed to a full rewrite

Flip conditions

Robustness: 78%

medium — Remove the latency assumption of ~1-2ms per Envoy sidecar transcoding hop. → b003
low — Reduce service count to fewer than 10 or call chain depth to fewer than 3. → b005

Unresolved uncertainty

  • The winning branch (b002) was critiqued for not directly inventorying what breaks — it focuses on migration strategy rather than a comprehensive breakage catalog. The killed b003 branch had significantly more specific technical failure modes (proto schema corruption, transcoding latency accumulation) that the winner lacks.
  • No branch provided a complete 'what breaks' inventory covering all dimensions: load balancer reconfiguration, observability pipeline changes, testing tool replacement, CI/CD pipeline modifications, service mesh compatibility, and team skill gaps.
  • The <50ms threshold for Class A services and the 6/12-month timelines are synthetic — no branch grounded these numbers in measured system data or named engineering heuristics.
  • Verdict is largely model-reasoning only — the 3 evidence items (quality mean=1.00) all mapped to b003 which was killed. The surviving winner has no external evidence support.
  • REST+HTTP/3 optimization (b006's point) was not seriously evaluated against gRPC for internal services — this remains a legitimate unexplored alternative that could change the recommendation if benchmarked.

Branch battle map

R1R2R3R4Censor reopenb001b002b003b004b005b006
Battle timeline (4 rounds)
Round 1 — Initial positions · 4 branches
Socrates proposed branch b004
Loki proposed branch b005
Socrates Reframe the problem: Instead of asking what communication protocol to use, ask w…
Loki Swap the key constraint of distributed microservices over the network: Assume al…
Round 2 — Adversarial probes · 3 branches
Branch b004 (Socrates) eliminated — Branch b004 commits a classic architectural deflection: i...
Branch b005 (Loki) eliminated — The suggestion to unify all internal services within a si...
Round 3 — Structural challenge · 2 branches
Branch b001 (Vulcan) eliminated — The hybrid strategy (b001) assumes we can simply categori...
Round 4 — Final convergence · 2 branches
Branch b003 (Daedalus) eliminated — The oversight critique correctly identifies that b003, de...
Loki proposed branch b006
Loki What if the opposite were true? What *improves* if we optimize REST with HTTP/3,…

Minority report

What if the opposite were true? What *improves* if we optimize REST with HTTP/3, compression, and JSON Schema instead of chasing gRPC? Both branches fixate on gRPC's hype while ignoring REST's maturity in caching, idempotency, and ecosystem tooling.

Loki · dissent strength 40%

Pre-mortem (3 scenarios)
The shared proto registry (Buf Schema Registry) goes down unexpectedly for 4 hours during the migration.
Critical services experience partial or complete outages due to schema mismatches, requiring a rollback of the migration and delaying the project by several weeks.
A key team misuses gRPC deadline settings, setting excessively small timeout values in their clients.
Critical-path services suffer major outages under production load, eroding stakeholder confidence and requiring significant unplanned engineering time to fix misconfigured clients.
The decommissioning of REST endpoints proceeds after 14 days of zero REST traffic, but a rarely used partner integration pipeline relying on REST is overlooked during usage audits.
Significant partner dissatisfaction and reputational damage occur, with additional engineering resources required to create a temporary bridge or re-enable REST endpoints.

Censor oversight

REOPEN SPAR

The winning decision (b003) provides a detailed migration plan but fails to directly address the original question 'what breaks'. It also assumes certain expertise and doesn't scope infrastructure coupling. Surviving branch b002 offers a more nuanced approach that was not selected despite higher confidence in some model outputs.

Structural issues

  • SELECTION MISMATCH: b002 provides a reasonable classification framework and polyglot persistence approach, which is more nuanced than b003's blanket gRPC migration
  • CONSULTING FOG: The winning decision describes a migration plan but doesn't directly address 'what breaks' when switching to gRPC
Markdown JSON