VMonarch: Efficient Video Diffusion Transformers with Structured Attention

Cheng Liang; Haoxian Chen; Liang Hou; Qi Fan; Gangshan Wu; Xin Tao; Limin Wang

arXiv:2601.22275·cs.CV·February 2, 2026

VMonarch: Efficient Video Diffusion Transformers with Structured Attention

Cheng Liang, Haoxian Chen, Liang Hou, Qi Fan, Gangshan Wu, Xin Tao, Limin Wang

PDF

Open Access

TL;DR

VMonarch introduces a structured attention mechanism using Monarch matrices for Video Diffusion Transformers, significantly reducing computational complexity and enabling efficient long-video processing with comparable or better quality.

Contribution

The paper proposes VMonarch, a novel structured attention method with Monarch matrices that achieves sub-quadratic complexity and improves efficiency over existing sparse attention techniques.

Findings

01

Reduces attention FLOPs by 17.5 times

02

Achieves over 5x speedup in attention computation for long videos

03

Maintains or improves generation quality compared to full attention

Abstract

The quadratic complexity of the attention mechanism severely limits the context scalability of Video Diffusion Transformers (DiTs). We find that the highly sparse spatio-temporal attention patterns exhibited in Video DiTs can be naturally represented by the Monarch matrix. It is a class of structured matrices with flexible sparsity, enabling sub-quadratic attention via an alternating minimization algorithm. Accordingly, we propose VMonarch, a novel attention mechanism for Video DiTs that enables efficient computation over the dynamic sparse patterns with structured Monarch matrices. First, we adapt spatio-temporal Monarch factorization to explicitly capture the intra-frame and inter-frame correlations of the video data. Second, we introduce a recomputation strategy to mitigate artifacts arising from instabilities during alternating minimization of Monarch matrices. Third, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Enhancement Techniques · Advanced Memory and Neural Computing · Visual Attention and Saliency Detection