Minute-Long Videos with Dual Parallelisms
Zeqing Wang, Bowen Zheng, Xingyi Yang, Zhenxiong Tan, Yuecong Xu, Xinchao Wang

TL;DR
This paper introduces DualParal, a distributed inference method for diffusion transformer video models that parallelizes across both temporal frames and model layers, significantly reducing latency and memory costs for long video generation.
Contribution
We propose a novel distributed inference strategy that enables efficient, asynchronous, and scalable long video generation by parallelizing across frames and layers with synchronization and feature caching.
Findings
Achieves 6.54× lower latency on 8×RTX 4090 GPUs.
Reduces memory cost by 1.48× while maintaining quality.
Enables generation of 1025-frame videos with artifact-free quality.
Abstract
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsImage and Video Quality Assessment · Generative Adversarial Networks and Image Synthesis · Video Coding and Compression Technologies
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Layer Normalization · Byte Pair Encoding
