VMoBA: Mixture-of-Block Attention for Video Diffusion Models
Jianzong Wu, Liang Hou, Haotian Yang, Xin Tao, Ye Tian, Pengfei Wan, Di Zhang, Yunhai Tong

TL;DR
This paper introduces VMoBA, a novel sparse attention mechanism tailored for Video Diffusion Models, significantly reducing computational complexity while maintaining or improving video generation quality.
Contribution
The paper proposes VMoBA, a new sparse attention method with dynamic block selection and layer-wise adaptation, specifically designed for efficient and effective video diffusion modeling.
Findings
Achieves 2.92x FLOPs reduction in training speed.
Attains 1.48x latency speedup during training.
Maintains or improves video generation quality.
Abstract
The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are designed as training-free inference accelerators or do not optimally capture the unique spatio-temporal characteristics inherent in video data when trained natively. This paper introduces Video Mixture of Block Attention (VMoBA), a novel sparse attention mechanism specifically adapted for VDMs. Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, which revealed strong spatio-temporal locality, varying query importance, and head-specific concentration levels, VMoBA enhances the original MoBA framework with three key modifications: (1) a layer-wise recurrent block partition scheme (1D-2D-3D) to dynamically…
Peer Reviews
Decision·ICLR 2026 Poster
Overall, the paper could be considered as an all-round incremental extension for MoBA. 1. VMoBA provides 2.92 times FLOPS reduction and 1.44 times speed up compared to original model, which shows certain practical feasibility in terms of training. 2. The ablations are relatively complete, which ensures the model design is informed.
There several points about the main experiment that needs to be further clarified: 1. In training-based part of the main experiments, no trainable sparse attention pattern is included, which lacks certain universality in terms of benchmarking. 2. In the training-based part of the main experiments, only one baseline method designed for accelerating training is presented. 3. The reason why methods for inference acceleration is used as a baseline for benchmarking training acceleration has not been
- The three proposed modifications (1D-2D-3D partitioning, global selection, threshold-based sparsity) are well-motivated by the observed limitations of applying MoBA naively to video. Each component is clearly linked to a specific empirical observation. - The authors evaluate VMoBA in both training-based and training-free settings across multiple resolutions, using standard metrics (VBench, PSNR) and complete ablation studies. VMoBA achieves FLOPs reduction and training-time speedup with minima
- The study omits some recent linear or hybrid video attentions that could serve as stronger baselines, such as STA[1] and RainFusion[2]. - The paper should include more human evaluation. Human judgment on video quality and video consistency is crucial for assessing the performance. - In the global selection part, this module prioritizes key blocks with the highest overall significance, but may overlook certain keys that are locally relevant to queries yet have low global scores. The high-freque
1. The paper diagnoses three phenomena in DiT attention—1D/2D/3D locality, uneven query importance, and head-wise concentration—then maps them to three design choices (1–2–3D recurrent partitioning, global selection, and thresholded selection). This “observations → mechanisms” linkage is well argued. 2. Unlike training-free sparse attention method, VMoBA is a trainable block-sparsity scheme intended to replace full attention during training . This positions it to deliver training compute savings
1. The paper reports ~2.40× FLOPs reduction but only ~1.35× end-to-end latency speedup in the training-free 720p setting. A deeper breakdown (kernel MFU, QK/softmax/AttnV time, IO cost) is needed to explain the under-translation from theoretical to realized speed. 2. Several contributions hinge on MoBA being less efficient, yet the paper does not thoroughly analyze why MoBA is inefficient and why the proposed methods can make it more efficieint. 3. Strong, recent sparse-attention baselines are m
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection · Advanced Neural Network Applications
