TL;DR
This paper introduces SVOO, a training-free sparse attention method for video generation that leverages layer-wise attention sparsity and bidirectional co-clustering to improve efficiency without sacrificing quality.
Contribution
SVOO is a novel framework that uses offline layer profiling and online co-clustering for sparse attention, addressing layer heterogeneity and query-key coupling issues.
Findings
Achieves up to 1.93x speedup in video generation.
Maintains PSNR of up to 29 dB on Wan2.1.
Outperforms state-of-the-art sparse attention methods.
Abstract
Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, motivating sparse attention techniques for improving efficiency. However, existing training-free sparse attention methods for video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight: attention sparsity is an intrinsic layer-wise property, with only minor variation across different inputs. Motivated by this observation, we propose SVOO, a training-free sparse attention framework for fast video generation via offline layer-wise sparsity profiling and online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
