SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer
Tongcheng Fang, Hanling Zhang, Ruiqi Xie, Zhuo Han, Xin Tao, Tianchen Zhao, Pengfei Wan, Wenbo Ding, Wanli Ouyang, Xuefei Ning, Yu Wang

TL;DR
SALAD introduces a lightweight linear attention branch with a static-dynamic scaling strategy to achieve up to 90% sparsity and over 1.5x speedup in video diffusion transformers without sacrificing quality.
Contribution
The paper presents a novel parallel linear attention method with a scaling strategy, enabling high sparsity and efficient finetuning in video diffusion transformers.
Findings
Achieves up to 90% sparsity in attention mechanisms.
Provides 1.52-2.03x inference speedup across models.
Requires only 2,000 video samples and 30 GPU hours for finetuning.
Abstract
Diffusion Transformers have demonstrated remarkable performance in video generation. However, their long input sequences incur substantial latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free approaches are limited to moderate sparsity and thus yield only modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. Leveraging a Multi-level Static-Dynamic Scaling Strategy to balance the two branches, our method attains up to 90% sparsity and 1.52-2.03x inference speedup across different models and sequence lengths, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
