Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
Bowen Pan, Yikang Shen, Haokun Liu, Mayank Mishra, Gaoyuan Zhang, Aude, Oliva, Colin Raffel, Rameswar Panda

TL;DR
This paper introduces DS-MoE, a hybrid dense training and sparse inference approach for Mixture-of-Experts models, achieving high efficiency and performance comparable to dense models while reducing computational costs.
Contribution
Proposes a novel hybrid dense-sparse framework (DS-MoE) that improves parameter and computational efficiency during training and inference of MoE language models.
Findings
DS-MoE models are more parameter-efficient than standard sparse MoEs.
DS-MoE achieves performance comparable to dense models with fewer active parameters.
DS-MoE models run up to 1.86x faster than similar dense models.
Abstract
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4 compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally require 2-4 times more parameters to achieve comparable performance to a dense model, which incurs larger GPU memory requirements and makes MoE models less efficient in I/O-bounded scenarios like autoregressive generation. In this work, we propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency by employing dense computation across all experts during training and sparse computation during inference. Our experiments on training LLMs demonstrate that our DS-MoE models are more parameter-efficient than standard sparse MoEs and are on par with dense models in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗MonetLLM/monet-vd-1.4B-100BT-hfmodel· 12 dl· ♡ 112 dl♡ 1
- 🤗MonetLLM/codemonet-vd-1.4B-100BT-hfmodel· 4 dl· ♡ 24 dl♡ 2
- 🤗MonetLLM/monet-hd-1.4B-100BT-hfmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗MonetLLM/monet-hd-4.1B-100BT-hfmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗MonetLLM/monet-hd-850M-100BT-hfmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗MonetLLM/monet-vd-4.1B-100BT-hfmodel· 3 dl· ♡ 23 dl♡ 2
- 🤗MonetLLM/monet-vd-850M-100BT-hfmodel· 95 dl· ♡ 295 dl♡ 2
- 🤗MonetLLM/visionmonet-vd-1.4B-100BT-hfmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗MonetLLM/monet-vd-1.4B-100BT-chat-hfmodel· 4 dl· ♡ 24 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare
MethodsMixture of Experts
