AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies
Bo-Wen Zhang, Liangdong Wang, Ye Yuan, Jijie Li, Shuhao Gu, Mengdi, Zhao, Xinya Wu, Guang Liu, Chengwei Wu, Hanyu Zhao, Li Du, Yiming Ju, Quanyue, Ma, Yulong Ao, Yingli Zhao, Songhe Zhu, Zhou Cao, Dong Liang, Yonghua Lin,, Ming Zhang, Shunfei Wang, Yanxin Zhou, Min Ye

TL;DR
AquilaMoE introduces an efficient two-stage training methodology for large-scale MoE language models, leveraging knowledge transfer from smaller models to reduce data needs and improve training efficiency.
Contribution
The paper presents a novel EfficientScale training approach for MoE models, combining scale-up and scale-out strategies to enhance performance and reduce resource requirements.
Findings
Successful training of a 16B model using knowledge transfer
AquilaMoE achieves reduced loss during continuous pretraining
Demonstrated improved training efficiency over traditional methods
Abstract
In recent years, with the rapid application of large language models across various fields, the scale of these models has gradually increased, and the resources required for their pre-training have grown exponentially. Training an LLM from scratch will cost a lot of computation resources while scaling up from a smaller model is a more efficient approach and has thus attracted significant attention. In this paper, we present AquilaMoE, a cutting-edge bilingual 8*16B Mixture of Experts (MoE) language model that has 8 experts with 16 billion parameters each and is developed using an innovative training methodology called EfficientScale. This approach optimizes performance while minimizing data requirements through a two-stage process. The first stage, termed Scale-Up, initializes the larger model with weights from a pre-trained smaller model, enabling substantial knowledge transfer and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsMixture of Experts
