GORGO: Maximizing KV-Cache Reuse While Minimizing Network Latency in Cross-Region LLM Load Balancing
Alessio Ricci Toniolo, Abinaya Dinesh, Rome Thorstenson

TL;DR
The paper introduces GORGO, a cross-region load balancing method for large language model inference that optimizes for minimal total serving time by considering compute, cache, and network latency, outperforming existing baselines.
Contribution
GORGO is a novel load balancing approach that jointly optimizes cache reuse and network latency, with extensive profiling and benchmarking demonstrating significant TTFT improvements.
Findings
GORGO reduces P99 TTFT by optimizing network-aware routing.
It prevents pathological cross-region forwarding, improving average TTFT.
GORGO-proxy is 2.5x faster on median TTFT, overcoming synchronization overhead.
Abstract
Distributing LLM inference across geographical regions can improve Time-to-First-Token (TTFT) by regionalizing service deployments. While existing multi-region load balancers save prefill computation by prioritizing Key--Value (KV) Cache hit rate, they ignore cluster networking latency, a critical factor in routing decisions. We introduce GORGO, a method for minimizing TTFT by optimizing a total serving cost as a function of available compute, network latency, and prefix caching. Using extensive profiling on custom infrastructure, we analyze component-level latency bottlenecks and benchmark GORGO against three baselines: (1) naive least-load routing, which ignores prefix-cache overlap; (2) prefix-similarity routing, which selectively pushes requests to the replica with the highest cached-prefix overlap; and (3) a centralized HTTP proxy that runs the GORGO policy while tracking requests…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Cloud Computing and Resource Management · Software System Performance and Reliability
