Fast Video Generation with Sliding Tile Attention

Peiyuan Zhang; Yongqi Chen; Runlong Su; Hangliang Ding; Ion Stoica; Zhengzhong Liu; Hao Zhang

arXiv:2502.04507·cs.CV·June 6, 2025

Fast Video Generation with Sliding Tile Attention

Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, Hao Zhang

PDF

Open Access 1 Repo 7 Models 2 Datasets 1 Video

TL;DR

This paper introduces sliding tile attention (STA), a hardware-efficient local attention mechanism that significantly accelerates 3D video diffusion models without quality loss, enabling faster video generation.

Contribution

The paper proposes STA, a novel local attention method with hardware-aware design, achieving substantial speedups in video diffusion models without retraining.

Findings

01

STA accelerates attention by up to 17x over FlashAttention-2.

02

End-to-end video generation latency reduced from 945s to 685s without quality loss.

03

Code released publicly for reproducibility.

Abstract

Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hao-ai-lab/fastvideo
pytorch

Models

Datasets

Videos

Fast Video Generation with Sliding Tile Attention· slideslive

Taxonomy

TopicsAdvanced Optical Imaging Technologies · Advanced Vision and Imaging · Cellular Automata and Applications

MethodsSoftmax · Attention Is All You Need · Diffusion