Yuan 2.0-M32: Mixture of Experts with Attention Router
Shaohua Wu, Jiangang Luo, Xi Chen, Lingjun Li, Xudong Zhao, Tong Yu,, Chao Wang, Yue Wang, Fei Wang, Weixu Qiao, Houbo He, Zeru Zhang, Zeyu Sun,, Junxiong Mao, Chong Shen

TL;DR
Yuan 2.0-M32 introduces an attention-based router for mixture-of-experts models, achieving high accuracy with significantly reduced computational costs and surpassing larger models on key benchmarks.
Contribution
The paper proposes the Attention Router for mixture-of-experts models, improving expert selection efficiency and accuracy over classical routers.
Findings
Achieves 55.89% on MATH benchmark.
Surpasses Llama3-70B on ARC-Challenge with 95.8% accuracy.
Uses only 9.25% of the training computation of a dense model.
Abstract
Yuan 2.0-M32, with a similar base architecture as Yuan-2.0 2B, uses a mixture-of-experts architecture with 32 experts of which 2 experts are active. A new router network, Attention Router, is proposed and adopted for a more efficient selection of experts, which improves the accuracy compared to the model with classical router network. Yuan 2.0-M32 is trained with 2000B tokens from scratch, and the training computation consumption is only 9.25% of a dense model at the same parameter scale. Yuan 2.0-M32 demonstrates competitive capability on coding, math, and various domains of expertise, with only 3.7B active parameters of 40B in total, and 7.4 GFlops forward computation per token, both of which are only 1/19 of Llama3-70B. Yuan 2.0-M32 surpass Llama3-70B on MATH and ARC-Challenge benchmark, with accuracy of 55.89 and 95.8 respectively. The models and source codes of Yuan 2.0-M32 are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExpert finding and Q&A systems · Indoor and Outdoor Localization Technologies
MethodsBalanced Selection
