DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72

Wanqian Li; Jintao Peng; Zongfei Jing; Tianyu Zhang; Ze Long; Xianjie Qiao; Xiaoming Chen; Dongxu Yang; Kefeng Duan; June Yang

arXiv:2604.01621·cs.DC·May 13, 2026

DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72

Wanqian Li, Jintao Peng, Zongfei Jing, Tianyu Zhang, Ze Long, Xianjie Qiao, Xiaoming Chen, Dongxu Yang, Kefeng Duan, June Yang

PDF

TL;DR

DWDP introduces a novel inference parallelization strategy for large language models that eliminates inter-rank synchronization, enabling more efficient multi-GPU execution and improved throughput.

Contribution

It proposes DWDP, a new parallelization method that offloads MoE weights and fetches experts on demand, reducing synchronization overhead in LLM inference.

Findings

01

DWDP improves end-to-end output throughput by 8.8% on NVL72 hardware.

02

It enables GPUs to progress independently, reducing workload imbalance effects.

03

Implemented in TensorRT-LLM, tested with DeepSeek-R1 on GB200 NVL72.

Abstract

Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.