Communication-Efficient Serving for Video Diffusion Models with Latent Parallelism

Zhiyuan Wu; Shuai Wang; Li Chen; Kaihui Gao; Dan Li; Yanyu Ren; Qiming Zhang; Yong Wang

arXiv:2512.07350·cs.DC·December 9, 2025

Communication-Efficient Serving for Video Diffusion Models with Latent Parallelism

Zhiyuan Wu, Shuai Wang, Li Chen, Kaihui Gao, Dan Li, Yanyu Ren, Qiming Zhang, Yong Wang

PDF

Open Access

TL;DR

This paper introduces Latent Parallelism, a novel parallelism strategy for video diffusion models that significantly reduces communication overhead while maintaining quality, enabling scalable video generation across multiple GPUs.

Contribution

The paper proposes Latent Parallelism, a new parallelism method tailored for VDM serving that exploits local spatio-temporal dependencies to reduce communication costs.

Findings

01

LP reduces communication overhead by up to 97% compared to baseline methods.

02

LP maintains comparable generation quality to existing strategies.

03

LP can be seamlessly integrated with existing parallelism techniques.

Abstract

Video diffusion models (VDMs) perform attention computation over the 3D spatio-temporal domain. Compared to large language models (LLMs) processing 1D sequences, their memory consumption scales cubically, necessitating parallel serving across multiple GPUs. Traditional parallelism strategies partition the computational graph, requiring frequent high-dimensional activation transfers that create severe communication bottlenecks. To tackle this issue, we exploit the local spatio-temporal dependencies inherent in the diffusion denoising process and propose Latent Parallelism (LP), the first parallelism strategy tailored for VDM serving. \textcolor{black}{LP decomposes the global denoising problem into parallelizable sub-problems by dynamically rotating the partitioning dimensions (temporal, height, and width) within the compact latent space across diffusion timesteps, substantially reducing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis