DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance
Xuan Shen, Chenxia Han, Yufa Zhou, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jiuxiang Gu

TL;DR
DraftAttention introduces a GPU-efficient, training-free sparse attention method for video diffusion transformers, significantly reducing computation while maintaining high-quality video generation by leveraging low-resolution guidance.
Contribution
It proposes a novel low-resolution draft attention framework that accelerates video diffusion transformers without additional training, improving speed and efficiency.
Findings
Achieves up to 1.75x speedup on GPUs.
Outperforms existing sparse attention methods in quality.
Maintains high-quality video generation with reduced computation.
Abstract
Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over 80% of total latency, and generating just 8 seconds of 720p video takes tens of minutes-posing serious challenges to practical application and scalability. To address this, we propose the DraftAttention, a training-free framework for the acceleration of video diffusion transformers with dynamic sparse attention on GPUs. We apply down-sampling to each feature map across frames in the compressed latent space, enabling a higher-level receptive field over the latent composed of hundreds of thousands of tokens. The low-resolution draft attention map, derived from draft query and key, exposes redundancy both spatially within each feature map and temporally…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The presentation is clear and well organized, with thorough motivation, method description, and experimental validation. - The computational bottlenecks in attention of video diffusion transformers are a pressing issue, and the proposed approach targets a key challenge for scaling video generation systems.
- The core idea—using pooling-based approximations to guide sparse block attention—is conceptually similar to prior work (e.g., MInference for LLMs and SpargeAttention for video diffusion). This overlap weakens the novelty claim. - The experimental evaluation omits direct comparisons with several relevant sparse or spatially adaptive attention methods such as SpargeAttention, SlidingTileAttention, RadialAttention, and XAttention, which are necessary to situate the method’s performance within t
1) The paper is well written; it is easy to read and understand. 2) The method is training-free and plug-and-play, it works with existing state-of-the-art video diffusion transformers. 3) Good empirical results: the approach achieves up to 2× speedup on A100/H100 GPUs while preserving video quality. 4) The method is orthogonal to other acceleration techniques; it can be combined with quantization or distillation for further efficiency gains.
1) The experiments mainly compare against Sparse VideoGen (SVG). Other sparse attention methods like AdaSpa are discussed but not included, which weakens claims of state-of-the-art performance. 2) The paper relies entirely on automated metrics (PSNR, SSIM, LPIPS, VBench) to assess perceptual quality. These metrics often fail to capture nuanced aspects of realism and user preference. A human study or preference test would significantly strengthen the quality claims. 3) There are no comparisons t
The paper presents DraftAttention, a training-free, plug-and-play sparse attention framework that can accelerate video diffusion transformers while maintaining generation quality. The method is simple, generalizable, and practical for integration into existing models without retraining.
**Limited novelty** The paper mainly proposes two techniques, but both with limited novelty: **(a)** Mean pooling compression to obtain a low-resolution attention map for guiding sparse attention has appeared in SpargeAttn, SeerAttention, etc. **(b)** Token reordering to exploit token sparsity and improve hardware efficiency has also been studied in SpargeAttn and SVG2. Moreover, the proposed token permutation is simple (it can be implemented with one line of code in Eniops) and
1. Simple and effective idea. Compatible with FP8. 2. Hardware-aware: the permutation bridges region-level masks to fixed-size block kernels. 3. Some theory with Frobenius-norm bounds. 4. Consistent quality at meaningful speedups; overhead appears small.
1. Missing highly relevant prior work. a. Low-res (mean-pooled) attention as a coarse proxy. MoBA and Quest (Query-aware sparsity) are, to my knowledge, the earliest to explicitly use mean-pooling as coarse scoring in LLM (what this paper's authors refer to as “low-resolution attention”); SpargeAttention and VSA later adopt analogous ideas in image/video. b. Locality-preserving permutations. Sliding Tile Attention first proposes locality-preserving permutations for efficient block-sparse att
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Image and Signal Denoising Methods · Advanced Vision and Imaging
MethodsSoftmax · Attention Is All You Need · Diffusion
