LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training
Tong Zhu,Xiaoye Qu,Daize Dong,Jiacheng Ruan,Jingqi Tong,Conghui He,Yu, Cheng

TL;DR
This paper presents a method to convert existing LLaMA models into Mixture-of-Experts models through expert partitioning and continual pre-training, resulting in larger models that outperform similar-sized dense models.
Contribution
It introduces a novel approach to build MoE models from pre-trained dense LLaMA models using expert construction and continual pre-training, avoiding training from scratch.
Findings
LLaMA-MoE models maintain language abilities after conversion.
LLaMA-MoE-3.5B models outperform similar-sized dense models.
Effective expert partitioning and pre-training strategies improve model performance.
Abstract
Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense large language models. Specifically, based on the well-known LLaMA-2 7B model, we obtain an MoE model by: (1) Expert Construction, which partitions the parameters of original Feed-Forward Networks (FFNs) into multiple experts; (2) Continual Pre-training, which further trains the transformed MoE model and additional gate networks. In this paper, we comprehensively explore different methods for expert construction and various data sampling strategies for continual pre-training. After these stages, our LLaMA-MoE models could maintain language abilities and route the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗llama-moe/LLaMA-MoE-v1-3_0B-2_16model· 195 dl· ♡ 11195 dl♡ 11
- 🤗llama-moe/LLaMA-MoE-v1-3_5B-4_16model· 281 dl· ♡ 16281 dl♡ 16
- 🤗llama-moe/LLaMA-MoE-v1-3_5B-2_8model· 796 dl· ♡ 15796 dl♡ 15
- 🤗llama-moe/LLaMA-MoE-v1-3_5B-2_8-sftmodel· 18 dl· ♡ 318 dl♡ 3
- 🤗llama-moe/LLaMA-MoE-v1-3_0B-2_16-sftmodel· 8 dl· ♡ 28 dl♡ 2
- 🤗llama-moe/LLaMA-MoE-v1-3_5B-4_16-sftmodel· 13 dl· ♡ 113 dl♡ 1
Videos
Taxonomy
TopicsContext-Aware Activity Recognition Systems · Mobile Crowdsensing and Crowdsourcing · Indoor and Outdoor Localization Technologies
MethodsMixture of Experts
