LaDiMo: Layer-wise Distillation Inspired MoEfier

Sungyoon Kim; Youngjun Kim; Kihyo Moon; Minsung Jang

arXiv:2408.04278·cs.CL·August 9, 2024

LaDiMo: Layer-wise Distillation Inspired MoEfier

Sungyoon Kim, Youngjun Kim, Kihyo Moon, Minsung Jang

PDF

Open Access

TL;DR

LaDiMo is a novel method that efficiently converts pre-trained dense Transformer models into sparse MoE models with minimal training, leveraging knowledge distillation and adaptive routing to reduce parameters and maintain accuracy.

Contribution

The paper introduces LaDiMo, a layer-wise distillation algorithm that transforms non-MoE models into MoE models efficiently, requiring only limited data and training.

Findings

01

Reduces activated parameters by over 20% in converted models.

02

Maintains high accuracy with minimal additional training.

03

Efficiently converts LLaMA2-7B to MoE using only 100K tokens.

Abstract

The advent of large language models has revolutionized natural language processing, but their increasing complexity has led to substantial training costs, resource demands, and environmental impacts. In response, sparse Mixture-of-Experts (MoE) models have emerged as a promising alternative to dense models. Since training MoE models from scratch can be prohibitively expensive, recent studies have explored leveraging knowledge from pre-trained non-MoE models. However, existing approaches have limitations, such as requiring significant hardware resources and data. We propose a novel algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model into a MoE model with minimal additional training cost. LaDiMo consists of two stages: layer-wise expert construction and routing policy decision. By harnessing the concept of Knowledge Distillation, we compress the model and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Analytical Chemistry and Sensors · Advanced Control Systems Optimization

MethodsMixture of Experts · Knowledge Distillation