Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei L\"u,, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, Xiaokun Wang, Yutuan Ma,, Rui Hu, Shuicheng Yan, Han Fang, Yahui Zhou

TL;DR
This paper presents Skywork-MoE, a large language model with 146 billion parameters, focusing on innovative training techniques like gating logit normalization and adaptive auxiliary loss, achieving strong benchmark performance.
Contribution
The paper introduces new training methodologies for MoE models, including gating logit normalization and adaptive auxiliary loss, validated through extensive experiments.
Findings
Gating logit normalization enhances expert diversification.
Adaptive auxiliary loss coefficients improve training stability.
Upcycling from dense checkpoints can be effective depending on resources.
Abstract
In this technical report, we introduce the training methodologies implemented in the development of Skywork-MoE, a high-performance mixture-of-experts (MoE) large language model (LLM) with 146 billion parameters and 16 experts. It is initialized from the pre-existing dense checkpoints of our Skywork-13B model. We explore the comparative effectiveness of upcycling versus training from scratch initializations. Our findings suggest that the choice between these two approaches should consider both the performance of the existing dense checkpoints and the MoE training budget. We highlight two innovative techniques: gating logit normalization, which improves expert diversification, and adaptive auxiliary loss coefficients, allowing for layer-specific adjustment of auxiliary loss coefficients. Our experimental results validate the effectiveness of these methods. Leveraging these techniques and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Topic Modeling
MethodsMixture of Experts
