NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding
Jiefei Chen, Binbin Lin, Jinming Ma, Jiangfei Duan, Haojie Duanmu, Hao Liu, Qinxiu Cheng, Xiuhong Li, Zhilin Pei, Hui Wang, Xingcheng Zhang, and Dahua Lin

TL;DR
NanoCP introduces dynamic context parallelism for MoE models, balancing communication and latency by assigning requests to parallelism degrees based on their KV footprint, improving throughput and reducing tail latency.
Contribution
It proposes a novel dynamic context parallelism approach that decouples MoE communication from KV cache placement, enabling better load balancing and latency management.
Findings
Achieves up to 3.27x higher request rates under strict SLOs.
Reduces P99 tail latency by up to 2.12x.
Balances KV cache occupancy and batch sizes effectively.
Abstract
Modern serving systems for Mixture-of-Experts (MoE) models adopt hybrid data-expert parallelism: expert parallelism (EP) shards experts across GPUs to scale capacity, while data parallelism (DP) replicates attention layers across instances to process independent requests. Existing systems bind each request's attention, MoE communication, and KV cache to a single instance. Because attention latency scales with KV cache size while MoE communication latency scales with batch size, this binding cannot balance both simultaneously, producing EP stragglers; it also fragments KV memory across instances, inflating tail latency under long contexts. While existing context parallelism (CP) mitigates these constraints, its uniform parallelism degree incurs prohibitive communication and attention-side overheads. We present \work, which decouples MoE communication from KV cache placement and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
