MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax: Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao

TL;DR
MiniMax-M1 is a large-scale hybrid-attention model with 456 billion parameters, supporting 1 million token context length, trained efficiently using a novel RL algorithm, and excels in long-context and complex tasks.
Contribution
The paper introduces MiniMax-M1, the first open-weight large-scale hybrid-attention model with lightning attention and a new RL algorithm, CISPO, enabling efficient training and superior performance on long-context tasks.
Findings
Supports 1 million token context length
Achieves training in three weeks on 512 GPUs at low cost
Outperforms comparable models on complex software engineering tasks
Abstract
We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗jingyaogong/minimind-3model· 81 dl· ♡ 181 dl♡ 1
- 🤗MiniMaxAI/MiniMax-M1-40kmodel· 21k dl· ♡ 18421k dl♡ 184
- 🤗MiniMaxAI/MiniMax-M1-80kmodel· 862 dl· ♡ 691862 dl♡ 691
- 🤗justinjja/MiniMax-M1-80k-W4A16-INT4model· 20 dl· ♡ 220 dl♡ 2
- 🤗MiniMaxAI/MiniMax-M1-80k-hfmodel· 54 dl· ♡ 854 dl♡ 8
- 🤗MiniMaxAI/MiniMax-M1-40k-hfmodel· 52 dl· ♡ 1252 dl♡ 12
- 🤗FriendliAI/MiniMax-M1-80kmodel· 21 dl21 dl
- 🤗jingyaogong/minimind-3-pytorchmodel
- 🤗jingyaogong/minimind-3-moemodel· 19 dl· ♡ 119 dl♡ 1
- 🤗jingyaogong/minimind-3-ggufmodel· 163 dl· ♡ 1163 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Computer Graphics and Visualization Techniques
