GORGO: Maximizing KV-Cache Reuse While Minimizing Network Latency in Cross-Region LLM Load Balancing

Alessio Ricci Toniolo; Abinaya Dinesh; Rome Thorstenson

arXiv:2602.11688·cs.NI·February 13, 2026

GORGO: Maximizing KV-Cache Reuse While Minimizing Network Latency in Cross-Region LLM Load Balancing

Alessio Ricci Toniolo, Abinaya Dinesh, Rome Thorstenson

PDF

Open Access

TL;DR

The paper introduces GORGO, a cross-region load balancing method for large language model inference that optimizes for minimal total serving time by considering compute, cache, and network latency, outperforming existing baselines.

Contribution

GORGO is a novel load balancing approach that jointly optimizes cache reuse and network latency, with extensive profiling and benchmarking demonstrating significant TTFT improvements.

Findings

01

GORGO reduces P99 TTFT by optimizing network-aware routing.

02

It prevents pathological cross-region forwarding, improving average TTFT.

03

GORGO-proxy is 2.5x faster on median TTFT, overcoming synchronization overhead.

Abstract

Distributing LLM inference across geographical regions can improve Time-to-First-Token (TTFT) by regionalizing service deployments. While existing multi-region load balancers save prefill computation by prioritizing Key--Value (KV) Cache hit rate, they ignore cluster networking latency, a critical factor in routing decisions. We introduce GORGO, a method for minimizing TTFT by optimizing a total serving cost as a function of available compute, network latency, and prefix caching. Using extensive profiling on custom infrastructure, we analyze component-level latency bottlenecks and benchmark GORGO against three baselines: (1) naive least-load routing, which ignores prefix-cache overlap; (2) prefix-similarity routing, which selectively pushes requests to the replica with the highest cached-prefix overlap; and (3) a centralized HTTP proxy that runs the GORGO policy while tracking requests…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Cloud Computing and Resource Management · Software System Performance and Reliability