DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving
Ying Yuan, Pengfei Zuo, Bo Wang, Zhangyu Chen, Zhipeng Tan, Zhou Yu

TL;DR
DualMap is a novel scheduling strategy for distributed LLM serving that balances cache affinity and load distribution by using dual hashing and intelligent candidate selection, significantly improving request capacity.
Contribution
It introduces a dual-mapping scheduling approach that unifies cache affinity and load balancing in distributed LLM serving, with techniques for dynamic workload adaptation.
Findings
Up to 2.25× increase in effective request capacity.
Improved cache reuse and load balancing under real-world workloads.
Enhanced robustness with SLO-aware routing and hotspot mitigation.
Abstract
In LLM serving, reusing the KV cache of prompts across requests is critical for reducing TTFT and serving costs. Cache-affinity scheduling, which co-locates requests with the same prompt prefix to maximize KV cache reuse, often conflicts with load-balancing scheduling that distributes requests evenly across compute instances. Existing schedulers fail to reconcile this trade-off as they operate within a single mapping space, typically applying cache-affinity routing to a subset of requests and load-balanced routing to the rest, without a unified solution to achieve both goals. To address this limitation, we propose DualMap, a dual-mapping scheduling strategy for distributed LLM serving that achieves both cache affinity and load balancing. Its key idea is to map each request to two candidate instances via two independent hash functions based on the request prompt, then intelligently…
Peer Reviews
Decision·ICLR 2026 Poster
I found this paper to be a good extension on existing work and provide decent/extensive results. They contribute and interesting hotspot aware rebalancing and light weight rebalancing.
There seems to be a lack of scalability/scheduler overhead analysis in the implementation. It would be interesting to see on more GPUs(even if simulated). The workload talks about cache migration based on TTFT but it would also be interesting if a direct NVLink transfer/memory cache state awareness was added to this policy.
- This paper solves the real trade-off in the LLM serving system: load balancing vs. cache affinity. - Comprehensive evaluation to show the advantage of the proposed method against multiple baselines.
- There are a few points that are unclear to me. See questions.
1. The problem is well defined: the trade-off between cache affinity and load balancing in LLM serving. 2. Extends the “power of two choices” concept to LLM scheduling, offering a novel way to achieve both objectives simultaneously. 3. The evaluation id comprehensive. Benchmarks across models and different baselines clearly show the superior performance.
1. The motivation for using two hashes for scheduling is unclear. Is it to save scheduling latency? Or, why not collect global information from all workers and then choose the best one (e.g. based on a weighted sum of prefix-cache and balance benefits)? The paper is very unclear on this point. 2. While “power of two choices” is cited, formal analysis of DualMap’s convergence or optimality is limited. 3. Lacks of scheduling overhead analysis.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems
