DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72
Wanqian Li, Jintao Peng, Zongfei Jing, Tianyu Zhang, Ze Long, Xianjie Qiao, Xiaoming Chen, Dongxu Yang, Kefeng Duan, June Yang

TL;DR
DWDP introduces a novel inference parallelization strategy for large language models that eliminates inter-rank synchronization, enabling more efficient multi-GPU execution and improved throughput.
Contribution
It proposes DWDP, a new parallelization method that offloads MoE weights and fetches experts on demand, reducing synchronization overhead in LLM inference.
Findings
DWDP improves end-to-end output throughput by 8.8% on NVL72 hardware.
It enables GPUs to progress independently, reducing workload imbalance effects.
Implemented in TensorRT-LLM, tested with DeepSeek-R1 on GB200 NVL72.
Abstract
Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
