FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation

Minh Khoa Le; Kien Do; Duc Thanh Nguyen; Truyen Tran

arXiv:2603.09721·cs.CV·April 21, 2026

FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation

Minh Khoa Le, Kien Do, Duc Thanh Nguyen, Truyen Tran

PDF

TL;DR

FrameDiT introduces Matrix Attention, a novel frame-level attention mechanism for diffusion models that improves global spatio-temporal modeling in video generation, achieving state-of-the-art results.

Contribution

The paper proposes Matrix Attention, a new attention mechanism that processes entire frames as matrices, balancing efficiency and global temporal modeling in diffusion-based video generation.

Findings

01

FrameDiT-H achieves state-of-the-art performance on multiple benchmarks.

02

Matrix Attention effectively captures global temporal dynamics.

03

The approach maintains efficiency comparable to local attention methods.

Abstract

High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of spatio-temporal tokens which can be modeled using Diffusion Transformers (DiTs). However, this approach faces a trade-off between the strong but expensive Full 3D Attention and the efficient but temporally limited Local Factorized Attention. To resolve this trade-off, we propose Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations. By attending across frames rather than tokens, Matrix Attention effectively preserves global spatio-temporal structure and adapts to significant motion. We build FrameDiT-G, a DiT architecture based on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.