BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation
Youping Gu, Xiaolong Li, Yuhao Hu, Minqi Chen, Bohan Zhuang

TL;DR
BLADE introduces a joint training framework combining adaptive block-sparse attention and sparsity-aware step distillation, significantly accelerating diffusion-based video generation while improving quality without additional data.
Contribution
It proposes a novel data-free joint training method that integrates adaptive sparse attention with step distillation for efficient video diffusion models.
Findings
Achieves 14.10x inference speedup on Wan2.1-1.3B models.
Delivers 8.89x speedup on short video models like CogVideoX-5B.
Improves quality scores on VBench-2.0 benchmark.
Abstract
Diffusion Transformers currently lead the field in high-quality video generation, but their slow iterative denoising process and prohibitive quadratic attention costs for long sequences create significant inference bottlenecks. While both step distillation and sparse attention mechanisms have shown promise as independent acceleration strategies, effectively combining these approaches presents critical challenges -- training-free integration yields suboptimal results, while separately training sparse attention after step distillation requires prohibitively expensive high-quality video data. To overcome these limitations, we propose BLADE, an innovative data-free joint training framework that introduces: (1) an Adaptive Block-Sparse Attention (ASA) mechanism for dynamically generating content-aware sparsity masks to focus computation on salient spatiotemporal features, and (2) a…
Peer Reviews
Decision·ICLR 2026 Poster
The innovative BLADE framework effectively addresses the computational bottleneck in accelerating inference for video diffusion models by jointly training the sparse attention mechanism (ASA) with trajectory distillation (TDM). This solution not only accelerates the generation process but also maintains high-quality outputs, especially in high sparsity conditions, achieving high-quality video generation with fewer steps, outperforming traditional methods. The paper is clearly motivated, well-wri
1. Although ASA's performance is compared with traditional sparse attention methods (e.g., STA, RaA, SVG), the paper does not delve into the impact of different sparsity patterns (e.g., varying threshold settings, block sizes) on generation quality. Ablation experiments with different sparse configurations could provide further insights. 2. BLADE optimizes generation performance by jointly training sparse attention and trajectory distribution matching. The core innovation here is the fusion of s
- Integration of adaptive block-sparse attention with step distillation, enabling data-free joint training for efficient video generation. - ASA mechanism dynamically generates content-aware sparsity masks that enable high sparsity levels, achieving hardware-friendly acceleration without quality loss when combined with distillation training. - Demonstrates substantial speedups (up to 14.10×) on diverse models like CogVideoX-5B and Wan2.1-1.3B, with consistent quality improvements on VBench.
This paper lacks details on experimental settings and comparative results, for example: - Lack of reporting on specific GPU hours, training batch size, and memory usage for the 100-200 distillation iterations. - Lack of inference results demonstrating video quality across low-to-high sparsity levels to illustrate the impact.
- Effective under both training-free and distillation-based settings. - Large speedups with stable quality (VBench and human evaluation confirmed). - Robust at high sparsity (~80%), outperforming similar methods. - Detailed pseudocode and source code are provided, making the method easy to follow and reproduce.
- Lacks large-scale and long-sequence experiments; - ASA is currently implemented in a custom Triton kernel and Block Sparse Attention library, and a more detailed analysis of the runtime contribution of each component would be helpful.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Video Coding and Compression Technologies · Advanced Image Processing Techniques
