MonarchRT: Efficient Attention for Real-Time Video Generation
Krish Agarwal, Zhuoming Chen, Cheng Luo, Yongqi Chen, Haizhong Zheng, Xun Huang, Atri Rudra, Beidi Chen

TL;DR
MonarchRT introduces a structured attention mechanism for diffusion models that significantly reduces computational costs, enabling real-time video generation at 16 FPS on standard GPUs without quality loss.
Contribution
We propose Monarch-RT, a novel structured attention parameterization that captures complex spatiotemporal dependencies efficiently, surpassing prior sparse attention methods in real-time video diffusion.
Findings
Achieves up to 95% attention sparsity with no quality loss.
Outperforms existing FlashAttention kernels by 1.4-11.8X speedup.
Enables real-time video generation at 16 FPS on a single RTX 5090.
Abstract
Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both few-step and autoregressive, where errors compound across time and each denoising step must carry substantially more information. In this setting, we find that prior sparse-attention approximations break down, despite showing strong results for bidirectional, many-step diffusion. Specifically, we observe that video attention is not reliably sparse, but instead combines pronounced periodic structure driven by spatiotemporal position with dynamic, sparse semantic correspondences and dense mixing, exceeding the representational capacity of even oracle top-k attention. Building on this insight, we propose Monarch-RT, a structured attention parameterization for video diffusion models that factorizes attention using Monarch matrices.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Human Motion and Animation
