FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
Jian Tang, Jiawei Fan, Qingbin Liu, Zheng Wei

TL;DR
FIS-DiT introduces a training-free, frame interleaved sparsity method to significantly accelerate video diffusion transformer inference, especially in few-step regimes, without substantial quality loss.
Contribution
The paper proposes a novel, training-free framework that exploits frame-wise sparsity and structural consistency to enhance inference speed in video diffusion transformers.
Findings
Achieves 2.11--2.41× speedup on benchmark datasets.
Maintains negligible quality degradation across key metrics.
Provides a scalable approach for real-time high-definition video generation.
Abstract
While the overall inference latency of Video Diffusion Transformers (DiTs) can be substantially reduced through model distillation, per-step inference latency remains a critical bottleneck. Existing acceleration paradigms primarily exploit redundancy across the denoising trajectory; however, we identify a limitation where these step-wise strategies encounter diminishing returns in few-step regimes. In such scenarios, the scarcity of temporal states prevents effective feature reuse or predictive modeling, creating a formidable barrier to further acceleration. To overcome this, we propose Frame Interleaved Sparsity DiT (FIS-DiT), a training-free and operator-agnostic framework that shifts the optimization focus from the temporal trajectory to the latent frame dimension. Our approach is motivated by an intrinsic duality within this dimension: the existence of frame-wise sparsity that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
