OrbitFlow: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration

Xinyue Ma; Heelim Hong; Taegeon Um; Jongseop Lee; Seoyeong Choy; Woo-Yeon Lee; Myeongjae Jeon

arXiv:2601.10729·cs.AI·March 3, 2026

OrbitFlow: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration

Xinyue Ma, Heelim Hong, Taegeon Um, Jongseop Lee, Seoyeong Choy, Woo-Yeon Lee, Myeongjae Jeon

PDF

Open Access

TL;DR

OrbitFlow is an adaptive system that manages KV cache placement for long-context LLM serving, significantly improving latency and throughput by dynamically optimizing cache retention based on runtime feedback.

Contribution

It introduces a fine-grained, runtime-adaptive KV cache management system using ILP optimization and fallback mechanisms to meet latency SLOs in long-context LLM serving.

Findings

01

Improves SLO attainment for TPOT and TBT by up to 66% and 48%.

02

Reduces 95th percentile latency by 38%.

03

Achieves up to 3.3x higher throughput.

Abstract

Serving long-context LLMs is challenging because request lengths and batch composition vary during token generation, causing the memory footprint to fluctuate significantly at runtime. Offloading KV caches to host memory limits effective memory usage, but existing static and predetermined offloading strategies cannot adapt to the rapidly shifting memory demands of long-context serving. This often leads to excessive CPU-to-GPU KV transfers that translate into latency spikes and frequent SLO violations. To address these challenges, we introduce OrbitFlow, a fine-grained and adaptive KV cache management system that meets latency SLOs in long-context LLM serving. OrbitFlow employs a lightweight ILP solver to decide which layers' KV caches to retain on the GPU for each request, within memory capacity constraints. It continuously refines KV placements based on runtime feedback when the active…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Cloud Computing and Resource Management