Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers

Yuxi Liu; Yipeng Hu; Zekun Zhang; Kunze Jiang; Kun Yuan

arXiv:2601.11641·cs.CV·February 4, 2026

Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers

Yuxi Liu, Yipeng Hu, Zekun Zhang, Kunze Jiang, Kun Yuan

PDF

Open Access

TL;DR

MOD-DiT introduces a dynamic, sampling-free attention method for video diffusion transformers that models evolving attention patterns efficiently, significantly improving generation speed and quality without costly sampling.

Contribution

It proposes a novel mixture-of-distribution framework for dynamic sparse attention, eliminating sampling and enhancing efficiency in video diffusion transformers.

Findings

01

Achieves faster video generation with maintained or improved quality.

02

Demonstrates consistent acceleration across multiple benchmarks.

03

Validates effectiveness over traditional sparse attention methods.

Abstract

While Diffusion Transformers (DiTs) have achieved notable progress in video generation, this long-sequence generation task remains constrained by the quadratic complexity inherent to self-attention mechanisms, creating significant barriers to practical deployment. Although sparse attention methods attempt to address this challenge, existing approaches either rely on oversimplified static patterns or require computationally expensive sampling operations to achieve dynamic sparsity, resulting in inaccurate pattern predictions and degraded generation quality. To overcome these limitations, we propose a \underline{\textbf{M}}ixture-\underline{\textbf{O}}f-\underline{\textbf{D}}istribution \textbf{DiT} (\textbf{MOD-DiT}), a novel sampling-free dynamic attention framework that accurately models evolving attention patterns through a two-stage process. First, MOD-DiT leverages prior information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Image and Video Quality Assessment