DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

Ying Yuan; Pengfei Zuo; Bo Wang; Zhangyu Chen; Zhipeng Tan; Zhou Yu

arXiv:2602.06502·cs.DC·February 9, 2026

DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

Ying Yuan, Pengfei Zuo, Bo Wang, Zhangyu Chen, Zhipeng Tan, Zhou Yu

PDF

Open Access 3 Reviews

TL;DR

DualMap is a novel scheduling strategy for distributed LLM serving that balances cache affinity and load distribution by using dual hashing and intelligent candidate selection, significantly improving request capacity.

Contribution

It introduces a dual-mapping scheduling approach that unifies cache affinity and load balancing in distributed LLM serving, with techniques for dynamic workload adaptation.

Findings

01

Up to 2.25× increase in effective request capacity.

02

Improved cache reuse and load balancing under real-world workloads.

03

Enhanced robustness with SLO-aware routing and hotspot mitigation.

Abstract

In LLM serving, reusing the KV cache of prompts across requests is critical for reducing TTFT and serving costs. Cache-affinity scheduling, which co-locates requests with the same prompt prefix to maximize KV cache reuse, often conflicts with load-balancing scheduling that distributes requests evenly across compute instances. Existing schedulers fail to reconcile this trade-off as they operate within a single mapping space, typically applying cache-affinity routing to a subset of requests and load-balanced routing to the rest, without a unified solution to achieve both goals. To address this limitation, we propose DualMap, a dual-mapping scheduling strategy for distributed LLM serving that achieves both cache affinity and load balancing. Its key idea is to map each request to two candidate instances via two independent hash functions based on the request prompt, then intelligently…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

I found this paper to be a good extension on existing work and provide decent/extensive results. They contribute and interesting hotspot aware rebalancing and light weight rebalancing.

Weaknesses

There seems to be a lack of scalability/scheduler overhead analysis in the implementation. It would be interesting to see on more GPUs(even if simulated). The workload talks about cache migration based on TTFT but it would also be interesting if a direct NVLink transfer/memory cache state awareness was added to this policy.

Reviewer 02Rating 8Confidence 3

Strengths

- This paper solves the real trade-off in the LLM serving system: load balancing vs. cache affinity. - Comprehensive evaluation to show the advantage of the proposed method against multiple baselines.

Weaknesses

- There are a few points that are unclear to me. See questions.

Reviewer 03Rating 4Confidence 4

Strengths

1. The problem is well defined: the trade-off between cache affinity and load balancing in LLM serving. 2. Extends the “power of two choices” concept to LLM scheduling, offering a novel way to achieve both objectives simultaneously. 3. The evaluation id comprehensive. Benchmarks across models and different baselines clearly show the superior performance.

Weaknesses

1. The motivation for using two hashes for scheduling is unclear. Is it to save scheduling latency? Or, why not collect global information from all workers and then choose the best one (e.g. based on a weighted sum of prefix-cache and balance benefits)? The paper is very unclear on this point. 2. While “power of two choices” is cited, formal analysis of DualMap’s convergence or optimality is limited. 3. Lacks of scheduling overhead analysis.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems