SkyWalker: A Locality-Aware Cross-Region Load Balancer for LLM Inference
Tian Xia, Ziming Mao, Jamison Kerney, Ethan J. Jackson, Zhifei Li, Jiarong Xing, Scott Shenker, Ion Stoica

TL;DR
SkyWalker is a novel multi-region load balancer for LLM inference that improves throughput, reduces latency, and cuts costs by intelligently aggregating regional traffic patterns while maintaining cache locality.
Contribution
It introduces a cache-aware, cross-region traffic handling mechanism that enables cost-effective and efficient multi-region LLM serving with preserved KV-Cache locality.
Findings
Achieves 1.12-2.06x higher throughput
Reduces latency by 1.74-6.30x
Cuts total serving cost by 25%
Abstract
Serving Large Language Models (LLMs) efficiently in multi-region setups remains a challenge. Due to cost and GPU availability concerns, providers typically deploy LLMs in multiple regions using instance with long-term commitments, like reserved instances or on-premise clusters, which are often underutilized due to their region-local traffic handling and diurnal traffic variance. In this paper, we introduce SkyWalker, a multi-region load balancer for LLM inference that aggregates regional diurnal patterns through cross-region traffic handling. By doing so, SkyWalker enables providers to reserve instances based on expected global demand, rather than peak demand in each individual region. Meanwhile, SkyWalker preserves KV-Cache locality and load balancing, ensuring cost efficiency without sacrificing performance. SkyWalker achieves this with a cache-aware cross-region traffic handler and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
