MonarchRT: Efficient Attention for Real-Time Video Generation

Krish Agarwal; Zhuoming Chen; Cheng Luo; Yongqi Chen; Haizhong Zheng; Xun Huang; Atri Rudra; Beidi Chen

arXiv:2602.12271·cs.CV·February 13, 2026

MonarchRT: Efficient Attention for Real-Time Video Generation

Krish Agarwal, Zhuoming Chen, Cheng Luo, Yongqi Chen, Haizhong Zheng, Xun Huang, Atri Rudra, Beidi Chen

PDF

Open Access

TL;DR

MonarchRT introduces a structured attention mechanism for diffusion models that significantly reduces computational costs, enabling real-time video generation at 16 FPS on standard GPUs without quality loss.

Contribution

We propose Monarch-RT, a novel structured attention parameterization that captures complex spatiotemporal dependencies efficiently, surpassing prior sparse attention methods in real-time video diffusion.

Findings

01

Achieves up to 95% attention sparsity with no quality loss.

02

Outperforms existing FlashAttention kernels by 1.4-11.8X speedup.

03

Enables real-time video generation at 16 FPS on a single RTX 5090.

Abstract

Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both few-step and autoregressive, where errors compound across time and each denoising step must carry substantially more information. In this setting, we find that prior sparse-attention approximations break down, despite showing strong results for bidirectional, many-step diffusion. Specifically, we observe that video attention is not reliably sparse, but instead combines pronounced periodic structure driven by spatiotemporal position with dynamic, sparse semantic correspondences and dense mixing, exceeding the representational capacity of even oracle top-k attention. Building on this insight, we propose Monarch-RT, a structured attention parameterization for video diffusion models that factorizes attention using Monarch matrices.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Human Motion and Animation