MoBA: Mixture of Block Attention for Long-Context LLMs

Enzhe Lu; Zhejun Jiang; Jingyuan Liu; Yulun Du; Tao Jiang; Chao Hong,; Shaowei Liu; Weiran He; Enming Yuan; Yuzhi Wang; Zhiqi Huang; Huan Yuan,; Suting Xu; Xinran Xu; Guokun Lai; Yanru Chen; Huabin Zheng; Junjie Yan,; Jianlin Su; Yuxin Wu; Neo Y. Zhang; Zhilin Yang; Xinyu Zhou; Mingxing Zhang,; Jiezhong Qiu

arXiv:2502.13189·cs.LG·February 20, 2025·2 cites

MoBA: Mixture of Block Attention for Long-Context LLMs

Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong,, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan,, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan,, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou

PDF

Open Access 1 Repo 1 Models

TL;DR

MoBA introduces a flexible attention mechanism for long-context LLMs that balances efficiency and performance by dynamically switching between full and sparse attention, advancing capabilities for complex reasoning tasks.

Contribution

The paper presents MoBA, a novel mixture of block attention architecture that enables autonomous, adaptive attention in LLMs, improving efficiency without sacrificing performance.

Findings

01

MoBA outperforms existing methods on long-context tasks.

02

It enables seamless transition between full and sparse attention.

03

Supports efficient long-context processing in real-world applications.

Abstract

Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the ``less structure'' principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

moonshotai/moba
pytorchOfficial

Models

🤗
zen-E/MoBA-1B
model· 611 dl
611 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Imaging Techniques and Applications

MethodsSoftmax · Attention Is All You Need