Fast Video Generation with Sliding Tile Attention
Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, Hao Zhang

TL;DR
This paper introduces sliding tile attention (STA), a hardware-efficient local attention mechanism that significantly accelerates 3D video diffusion models without quality loss, enabling faster video generation.
Contribution
The paper proposes STA, a novel local attention method with hardware-aware design, achieving substantial speedups in video diffusion models without retraining.
Findings
STA accelerates attention by up to 17x over FlashAttention-2.
End-to-end video generation latency reduced from 945s to 685s without quality loss.
Code released publicly for reproducibility.
Abstract
Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ai-forever/Wan2.1-T2V-14B-NABLA-0.7model· 68 dl· ♡ 568 dl♡ 5
- 🤗FastVideo/FastWan2.1-T2V-1.3B-Diffusersmodel· 200 dl· ♡ 20200 dl♡ 20
- 🤗FastVideo/FastWan2.1-T2V-14B-Diffusersmodel· 54 dl· ♡ 1854 dl♡ 18
- 🤗ai-forever/Wan2.1-T2V-14B-NABLA-0.6-STA-11-3-3model· 83 dl· ♡ 183 dl♡ 1
- 🤗ai-forever/Wan2.1-T2V-14B-NABLA-0.5-STA-11-5-5model· 32 dl32 dl
- 🤗FastVideo/Wan2.1-VSA-T2V-14B-720P-Diffusersmodel· 4 dl· ♡ 74 dl♡ 7
- 🤗FastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusersmodel· 22k dl· ♡ 6322k dl♡ 63
Videos
Taxonomy
TopicsAdvanced Optical Imaging Technologies · Advanced Vision and Imaging · Cellular Automata and Applications
MethodsSoftmax · Attention Is All You Need · Diffusion
