LaDiMo: Layer-wise Distillation Inspired MoEfier
Sungyoon Kim, Youngjun Kim, Kihyo Moon, Minsung Jang

TL;DR
LaDiMo is a novel method that efficiently converts pre-trained dense Transformer models into sparse MoE models with minimal training, leveraging knowledge distillation and adaptive routing to reduce parameters and maintain accuracy.
Contribution
The paper introduces LaDiMo, a layer-wise distillation algorithm that transforms non-MoE models into MoE models efficiently, requiring only limited data and training.
Findings
Reduces activated parameters by over 20% in converted models.
Maintains high accuracy with minimal additional training.
Efficiently converts LLaMA2-7B to MoE using only 100K tokens.
Abstract
The advent of large language models has revolutionized natural language processing, but their increasing complexity has led to substantial training costs, resource demands, and environmental impacts. In response, sparse Mixture-of-Experts (MoE) models have emerged as a promising alternative to dense models. Since training MoE models from scratch can be prohibitively expensive, recent studies have explored leveraging knowledge from pre-trained non-MoE models. However, existing approaches have limitations, such as requiring significant hardware resources and data. We propose a novel algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model into a MoE model with minimal additional training cost. LaDiMo consists of two stages: layer-wise expert construction and routing policy decision. By harnessing the concept of Knowledge Distillation, we compress the model and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Analytical Chemistry and Sensors · Advanced Control Systems Optimization
MethodsMixture of Experts · Knowledge Distillation
