Accelerating Video Generation Inference with Sequential-Parallel 3D Positional Encoding Using a Global Time Index

Chao Yuan; Pan Li

arXiv:2603.06664·cs.CV·March 10, 2026

Accelerating Video Generation Inference with Sequential-Parallel 3D Positional Encoding Using a Global Time Index

Chao Yuan, Pan Li

PDF

Open Access

TL;DR

This paper introduces system-level optimizations and a novel positional encoding to accelerate diffusion transformer-based video generation, achieving near real-time inference and reducing latency without sacrificing quality.

Contribution

The paper proposes a sequence-parallel causal rotary position embedding and system optimizations for diffusion transformer video models, enabling faster, real-time capable video synthesis.

Findings

01

Achieved 1.58x speedup in 480P video generation

02

Reduced first-frame latency to sub-second levels

03

Maintained comparable video quality with optimized system

Abstract

Diffusion Transformer (DiT)-based video generation models inherently suffer from bottlenecks in long video synthesis and real-time inference, which can be attributed to the use of full spatiotemporal attention. Specifically, this mechanism leads to explosive O(N^2) memory consumption and high first-frame latency. To address these issues, we implement system-level inference optimizations for a causal autoregressive video generation pipeline. We adapt the Self-Forcing causal autoregressive framework to sequence parallel inference and implement a sequence-parallel variant of the causal rotary position embedding which we refer to as Causal-RoPE SP. This adaptation enables localized computation and reduces cross-rank communication in sequence parallel execution. In addition, computation and communication pipelines are optimized through operator fusion and RoPE precomputation. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Coding and Compression Technologies · Human Motion and Animation