MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation
Weinan Jia, Yuning Lu, Mengqi Huang, Hualiang Wang, Binyuan Huang, Nan Chen, Mu Liu, Jidong Jiang, Zhendong Mao

TL;DR
MoGA introduces a learnable, semantic-aware sparse attention mechanism that significantly improves long video generation efficiency and quality by enabling precise token interactions without blockwise constraints.
Contribution
The paper proposes MoGA, a novel sparse attention method with a learnable token router, enhancing long-range interactions in long video generation models.
Findings
MoGA achieves efficient long-range attention with high accuracy.
The model generates minute-long, multi-shot videos at 24 fps with 580k context length.
Experiments confirm MoGA's effectiveness across various video tasks.
Abstract
Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p…
Peer Reviews
Decision·ICLR 2026 Poster
- The introduction of token-level routing as a “mixture-of-groups” attention mechanism is novel and intuitive. It replaces coarse block sparsity with a more fine-grained, data-driven grouping that potentially generalizes better across tasks. - The paper convincingly shows MoGA’s ability to reduce attention complexity from O(N^2) to approximately O(N^2/M) while preserving quality. The claimed 1.7× training/inference speedup . - The design is compatible with FlashAttention, sequence parallelism, a
### Major - My major concern is the novelty compared to the previous works. While MoGA's token router is inspired my MoE, and MoBA, it should be deeply analyzed the distinction between it and these works. The paper should clarify how fundamentally MoGA differs from the existing routing based methods, beyond the token level analysis. - The benchmarks are built on top of the models like wan2.1/MMDiT and tested on internal datasets, where the authors integrate their MoGA module into the original a
see Summary
The authors should provide between 20 to 50 video samples to better demonstrate the capabilities and limitations of their method. Additionally, for each prompt, it would be beneficial to include comparisons with other state-of-the-art methods. Ideally, there should be 3 to 5 comparison methods with corresponding videos for each prompt. The authors should provide detailed results of the user study, including statistical analysis and user feedback. This will help in understanding how the proposed
The paper's primary strength lies in its elegant and highly effective solution to the long-context problem. By replacing coarse block-level scoring with a precise, end-to-end token router, MoGA represents a conceptual advance over prior sparse attention methods. The approach is remarkably practical, as it is kernel-free and seamlessly integrates with existing high-performance technologies like FlashAttention and sequence parallelism. The experimental results are state-of-the-art and convincingly
One potential point of discussion is the method's reliance on a powerful, pre-existing base model (Wan2.1) for fine-tuning, which makes it slightly difficult to isolate the gains of MoGA from the inherent capabilities of the foundation model. Additionally, the paper introduces an impressive and complex data pipeline for creating multi-shot training samples; the importance of this high-quality, specialized data to the final result is significant and could be considered a major contribution in its
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Video Coding and Compression Technologies
