Dense Training, Sparse Inference: Rethinking Training of   Mixture-of-Experts Language Models

Bowen Pan; Yikang Shen; Haokun Liu; Mayank Mishra; Gaoyuan Zhang; Aude; Oliva; Colin Raffel; Rameswar Panda

arXiv:2404.05567·cs.LG·April 9, 2024·1 cites

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Bowen Pan, Yikang Shen, Haokun Liu, Mayank Mishra, Gaoyuan Zhang, Aude, Oliva, Colin Raffel, Rameswar Panda

PDF

Open Access 9 Models

TL;DR

This paper introduces DS-MoE, a hybrid dense training and sparse inference approach for Mixture-of-Experts models, achieving high efficiency and performance comparable to dense models while reducing computational costs.

Contribution

Proposes a novel hybrid dense-sparse framework (DS-MoE) that improves parameter and computational efficiency during training and inference of MoE language models.

Findings

01

DS-MoE models are more parameter-efficient than standard sparse MoEs.

02

DS-MoE achieves performance comparable to dense models with fewer active parameters.

03

DS-MoE models run up to 1.86x faster than similar dense models.

Abstract

Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4 $\times$ compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally require 2-4 $\times$ times more parameters to achieve comparable performance to a dense model, which incurs larger GPU memory requirements and makes MoE models less efficient in I/O-bounded scenarios like autoregressive generation. In this work, we propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency by employing dense computation across all experts during training and sparse computation during inference. Our experiments on training LLMs demonstrate that our DS-MoE models are more parameter-efficient than standard sparse MoEs and are on par with dense models in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare

MethodsMixture of Experts