Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective
Fangzhou Wu, Sandeep Silwal, Qiuyi (Richard) Zhang

TL;DR
This paper introduces a unified model for KV cache eviction and query routing in LLM inference, proposing algorithms that improve cache hit rates and load balancing, validated by extensive experiments showing significant performance gains.
Contribution
It provides the first unified mathematical framework for KV cache eviction and query routing, integrating randomized eviction with learning-based routing for better LLM inference efficiency.
Findings
Up to 6.92× increase in cache hit rate
Up to 11.96× reduction in latency
Up to 77.4% increase in throughput
Abstract
KV caching is a fundamental technique for accelerating Large Language Model (LLM) inference by reusing key-value (KV) pairs from previous queries, but its effectiveness under limited memory is highly sensitive to the eviction policy. The default Least Recently Used (LRU) eviction algorithm struggles with dynamic online query arrivals, especially in multi-LLM serving scenarios, where balancing query load across workers and maximizing cache hit rate of each worker are inherently conflicting objectives. We give the first unified mathematical model that captures the core trade-offs between KV cache eviction and query routing. Our analysis reveals the theoretical limitations of existing methods and leads to principled algorithms that integrate provably competitive randomized KV cache eviction with learning-based methods to adaptively route queries with evolving patterns, thus balancing query…
Peer Reviews
Decision·ICLR 2026 Poster
1. LLM KV cache managements and query routing is a very critical problem, and the paper does a good job of choosing an important problem to solve 2. The paper does quite a good job at piecing together the theoretical underpinnings of KV cache management, which makes the motivation of RLT and LBGR easy. 3. Section 3.1 does a great job at formalizing the notation and laying the groundwork for further sections. The lemmas are intuitive to understand 4. The experiments are quite extensive, covering
1. The improvements claims made in the intro should be qualified by model type and size, context length, HBM available etc. Otherwise it is hard to trust these numbers. Please take the time to segment the results into small vs large, dense vs MoE, relationship with context length etc. 2. Some recent literature reviews are missing. For example, [1] 3. The figures could be a bit better. For instance, Figure 5 is violating the margin. 4. The idea of the MIP is not used much throughout the paper, so
- Clear problem framing that couples cache eviction with routing - Strong empirical results across four benchmarks with higher hit rate and throughput
- Some assumptions are under-discussed (see questions)
1. This paper is clearly written, and the main problem is well-motivated from a practical LLM-serving perspective. 2. This paper provides a combination of a theoretical foundation and practical implementation. 3. I appreciate that the authors go beyond heuristic system designs and provide a theoretically grounded formulation together with competitive analysis for the cache eviction process.
1. As I understand, RLT may be affected by the random seed. It would be better to include an ablation study evaluating the stability of RLT under different random seed settings. 2. From Figure 6, it appears that the advantage of your method diminishes as the number of workers increases. Could you explain why the proposed approach cannot (or does not need to) scale to a larger number of workers? 3. Writing: It would be better to include a notation table in Section 3 to improve readability and hel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Cloud Computing and Resource Management · Big Data and Digital Economy
