Accelerating Video Generation Inference with Sequential-Parallel 3D Positional Encoding Using a Global Time Index
Chao Yuan, Pan Li

TL;DR
This paper introduces system-level optimizations and a novel positional encoding to accelerate diffusion transformer-based video generation, achieving near real-time inference and reducing latency without sacrificing quality.
Contribution
The paper proposes a sequence-parallel causal rotary position embedding and system optimizations for diffusion transformer video models, enabling faster, real-time capable video synthesis.
Findings
Achieved 1.58x speedup in 480P video generation
Reduced first-frame latency to sub-second levels
Maintained comparable video quality with optimized system
Abstract
Diffusion Transformer (DiT)-based video generation models inherently suffer from bottlenecks in long video synthesis and real-time inference, which can be attributed to the use of full spatiotemporal attention. Specifically, this mechanism leads to explosive O(N^2) memory consumption and high first-frame latency. To address these issues, we implement system-level inference optimizations for a causal autoregressive video generation pipeline. We adapt the Self-Forcing causal autoregressive framework to sequence parallel inference and implement a sequence-parallel variant of the causal rotary position embedding which we refer to as Causal-RoPE SP. This adaptation enables localized computation and reduces cross-rank communication in sequence parallel execution. In addition, computation and communication pipelines are optimized through operator fusion and RoPE precomputation. Experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Coding and Compression Technologies · Human Motion and Animation
