Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation

Qirui Li; Guangcong Zheng; Qi Zhao; Jie Li; Bin Dong; Yiwu Yao; Xi Li

arXiv:2508.12969·cs.CV·August 19, 2025

Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation

Qirui Li, Guangcong Zheng, Qi Zhao, Jie Li, Bin Dong, Yiwu Yao, Xi Li

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Compact Attention, a hardware-aware framework that exploits structured spatio-temporal sparsity in video transformers to accelerate attention computation significantly while maintaining visual quality.

Contribution

It presents a novel adaptive, hardware-aware sparse attention method that dynamically models diverse spatio-temporal patterns in video data for efficient long-form video generation.

Findings

01

Achieves 1.6 to 2.5 times faster attention computation on single-GPU.

02

Maintains comparable visual quality to full-attention models.

03

Effectively exploits structured sparsity in video transformers.

Abstract

The computational demands of self-attention mechanisms pose a critical challenge for transformer-based video generation, particularly in synthesizing ultra-long sequences. Current approaches, such as factorized attention and fixed sparse patterns, fail to fully exploit the inherent spatio-temporal redundancies in video data. Through systematic analysis of video diffusion transformers (DiT), we uncover a key insight: Attention matrices exhibit structured, yet heterogeneous sparsity patterns, where specialized heads dynamically attend to distinct spatiotemporal regions (e.g., local pattern, cross-shaped pattern, or global pattern). Existing sparse attention methods either impose rigid constraints or introduce significant overhead, limiting their effectiveness. To address this, we propose Compact Attention, a hardware-aware acceleration framework featuring three innovations: 1) Adaptive…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The authors demonstrates significant acceleration (up to 3×) with negligible quality loss, validated on strong baselines and realistic benchmarks. 2. The method is training-free and hardware-aware, making it practical for real-world applications and compatible with existing transformer architectures. 3. The study on the effectiveness on "delaying sparse attention" is good. Although it is well-observed in previous studies, none of them plot this trend.

Weaknesses

1. The offline mask search, though effective, could be computationally heavy for deployment across diverse configurations. The authors should add experiments to discuss how the final quality changes with respect to the computation spent based on this calibration process. 2. On highly dynamic or non-redundant video scenes, is the sparsity pattern still highly similar to the offline searched sparse attention pattern? It seems that in Figure 3(b) some similarity score is as low as 0.7 (based on co

Reviewer 02Rating 4Confidence 4

Strengths

The proposed approach is novel, particularly the idea of offline precomputation of sparse masks and the use of dual attention windows to represent attention masks. The kernel implementation is tailored for their method.

Weaknesses

The paper’s presentation is rather unclear in several places. For example: - The description of the greedy algorithm suggests a progressive contraction, but the pseudocode shows progressive expansion. - In Figure 4, there is a “Flag” term that is never defined in the text. - Sections such as Reuse Masks across Denoising Steps lack sufficient details. The experimental evaluation is limited. For instance, in Table 2, on Wan 2.1, the STA method achieves twice the sparsity of Compact Attention, w

Reviewer 03Rating 2Confidence 5

Strengths

1. This work shows that the proposed method achieves better performance than several baselines. 2. This work shows that they can achieve up to 3x acceleration.

Weaknesses

1. The novelty of spatial and temporal redundancy focus is limited, which is similar as SVG [1] . SVG also partitions tokens into local tiles, and employs cross-frame attention masks. 2. The offline search for the optimal static pruning strategy is empirically unconvincing. The sparse patterns always related to the input prompt, denoising step, layer depth, and seed. The author also show that the similarity is over 0.8, which means the static strategy may not be the optimal one and may lead to

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image Processing Techniques · Video Coding and Compression Technologies