Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation
Xuewen Liu, Zhikai Li, Jing Zhang, Mengjuan Chen, and Qingyi Gu

TL;DR
This paper introduces Rectified SpaAttn, a novel attention mechanism that improves the efficiency and accuracy of sparse attention in diffusion transformers for video generation, reducing computational costs while maintaining quality.
Contribution
We propose Rectified SpaAttn, which corrects biases in existing sparse attention methods, and develop specific rectification techniques to better align sparse and full attention maps.
Findings
Achieves up to 3.33x speedup on HunyuanVideo
Maintains high video generation quality
Addresses systematic biases in attention allocation
Abstract
Diffusion Transformers dominate video generation, but the quadratic complexity of attention computation introduces substantial latency. Attention sparsity reduces computational costs by focusing on critical tokens while ignoring non-critical tokens. However, existing methods suffer from severe performance degradation. In this paper, we revisit attention sparsity and reveal that existing methods induce systematic biases in attention allocation: (1) excessive focus on critical tokens amplifies their attention weights; (2) complete neglect of non-critical tokens causes the loss of relevant attention weights. To address these issues, we propose Rectified SpaAttn, which rectifies attention allocation with implicit full attention reference, thereby enhancing the alignment between sparse and full attention maps. Specifically: (1) for critical tokens, we show that their bias is proportional to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Visual Attention and Saliency Detection
