AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out   Strategies

Bo-Wen Zhang; Liangdong Wang; Ye Yuan; Jijie Li; Shuhao Gu; Mengdi; Zhao; Xinya Wu; Guang Liu; Chengwei Wu; Hanyu Zhao; Li Du; Yiming Ju; Quanyue; Ma; Yulong Ao; Yingli Zhao; Songhe Zhu; Zhou Cao; Dong Liang; Yonghua Lin,; Ming Zhang; Shunfei Wang; Yanxin Zhou; Min Ye; Xuekai Chen; Xinyang Yu,; Xiangjun Huang; Jian Yang

arXiv:2408.06567·cs.CL·August 14, 2024

AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies

Bo-Wen Zhang, Liangdong Wang, Ye Yuan, Jijie Li, Shuhao Gu, Mengdi, Zhao, Xinya Wu, Guang Liu, Chengwei Wu, Hanyu Zhao, Li Du, Yiming Ju, Quanyue, Ma, Yulong Ao, Yingli Zhao, Songhe Zhu, Zhou Cao, Dong Liang, Yonghua Lin,, Ming Zhang, Shunfei Wang, Yanxin Zhou, Min Ye

PDF

Open Access 1 Repo 2 Models

TL;DR

AquilaMoE introduces an efficient two-stage training methodology for large-scale MoE language models, leveraging knowledge transfer from smaller models to reduce data needs and improve training efficiency.

Contribution

The paper presents a novel EfficientScale training approach for MoE models, combining scale-up and scale-out strategies to enhance performance and reduce resource requirements.

Findings

01

Successful training of a 16B model using knowledge transfer

02

AquilaMoE achieves reduced loss during continuous pretraining

03

Demonstrated improved training efficiency over traditional methods

Abstract

In recent years, with the rapid application of large language models across various fields, the scale of these models has gradually increased, and the resources required for their pre-training have grown exponentially. Training an LLM from scratch will cost a lot of computation resources while scaling up from a smaller model is a more efficient approach and has thus attracted significant attention. In this paper, we present AquilaMoE, a cutting-edge bilingual 8*16B Mixture of Experts (MoE) language model that has 8 experts with 16 billion parameters each and is developed using an innovative training methodology called EfficientScale. This approach optimizes performance while minimizing data requirements through a two-stage process. The first stage, termed Scale-Up, initializes the larger model with weights from a pre-trained smaller model, enabling substantial knowledge transfer and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

FlagAI-Open/Aquila-MoE
none

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsMixture of Experts