Optimizing Mixture of Block Attention
Guangxuan Xiao, Junxian Guo, Kasra Mazaheri, Song Han

TL;DR
This paper analyzes and improves the Mixture of Block Attention (MoBA) mechanism for long-context processing in LLMs, introducing a hardware-efficient implementation that maintains performance while significantly reducing computational costs.
Contribution
It provides a statistical analysis of MoBA's performance factors, proposes architectural improvements, and introduces FlashMoBA, a GPU-optimized kernel enabling practical, efficient MoBA deployment.
Findings
Improved MoBA matches dense attention performance in LLMs.
FlashMoBA achieves up to 14.7x speedup over FlashAttention-2.
Small block sizes with clustering enhance routing accuracy.
Abstract
Mixture of Block Attention (MoBA) (Lu et al., 2025) is a promising building block for efficiently processing long contexts in LLMs by enabling queries to sparsely attend to a small subset of key-value blocks, drastically reducing computational cost. However, the design principles governing MoBA's performance are poorly understood, and it lacks an efficient GPU implementation, hindering its practical adoption. In this paper, we first develop a statistical model to analyze MoBA's underlying mechanics. Our model reveals that performance critically depends on the router's ability to accurately distinguish relevant from irrelevant blocks based on query-key affinities. We derive a signal-to-noise ratio that formally connects architectural parameters to this retrieval accuracy. Guided by our analysis, we identify two key pathways for improvement: using smaller block sizes and applying a short…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper motivates a better understanding of what aspects contribute to the improvement of the MoBA architecture. The characterization of the SNR as a function of the head size and block size is, to the best of my knowledge, novel and provides a good basic approximation on the framework to tune the performance of MoBA attention. 2. Their kernel implementation is particularly helpful for encouraging the broarder adaptation of the architecture. Given that attention tends to be a bottleneck fo
The following are my concerns with the paper: 1. The core contribution of the paper is relegated to the Appendix. This makes the paper a bit hard to follow, and given that the results motivate the majority of the paper, I do think that at least a part of it should be featured in the main paper. 2. There are a number of assumptions made in the statistical analysis that may not hold true and at the very least merit some grounding with experimental results: L805 makes the assumption that q^Tk ar
1. The paper offers a statistical view that links MoBA hyperparameters to the SNR of attention computation. Although the connection between SNR and end-to-end model performance is not formally derived, the analysis provides a useful proxy for selecting better MoBA hyperparameter configurations. 2. The implementation of a FlashAttention-style MoBA kernel makes the approach practical even with small block sizes. The kernel achieves comparable speed to FlashAttention on short sequences and delivers
1. Experimental setup: The model architecture setup introduces confounding factors. While the paper claims to focus on optimizing MoBA performance, the model architecture employs sliding window attention (SWA) in half of the layers and involves dense attention in others, limiting the proportion of true MoBA layers. This mixture complicates the attribution of performance improvements and makes it unclear how much gain comes from MoBA, instead of SWA or dense attention components. 2. Key convoluti
1. Theoretical Framework: The paper introduces a novel signal-to-noise ratio (SNR) model that provides clear and actionable design principles. This provides guidelines for the selection of head dimension and block size. 2. High-Performance CUDA Kernel: FlashMoBA is a well-engineered, hardware-aware CUDA kernel. The Tiled-Topk is especially useful. 3. Strong Benchmark Results: The optimized MoBA models are shown to match or even outperform dense attention on challenging long-context benchmarks li
1. Limited Generalizability Due to Small Model Scale: All experiments are conducted on a 340M parameter model. This raises significant questions about whether the paper's core findings would scale to the much larger models. 2. Unsubstantiated Link Between SNR and Experiments: The key experiment in Table 4, designed to validate the SNR theory's dependency on head dimension d, fails to control for model size (line 289). It is unclear if the performance improvements in Table 4 are due to the claime
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Packet Processing and Optimization · Graph Theory and Algorithms · Caching and Content Delivery
