MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models
Nurbek Tastan, Stefanos Laskaridis, Karthik Nandakumar, Samuel Horvath

TL;DR
MoSE introduces slimmable experts in mixture-of-experts models, enabling flexible, continuous accuracy-compute trade-offs during inference, improving efficiency without sacrificing performance.
Contribution
It proposes a novel MoE architecture with slimmable experts, allowing variable-width execution and continuous trade-offs, along with a training recipe and runtime strategies.
Findings
MoSE matches or surpasses standard MoE performance at full width.
MoSE shifts the Pareto frontier towards better accuracy with fewer FLOPs.
Effective training and inference strategies for slimmable experts in MoE models.
Abstract
Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully. Hence, the trade-off between accuracy and computation in an MoE model typically exhibits large discontinuities. We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths. This enables conditional computation not only over which experts are activated, but also over how much of each expert is utilized. Consequently, a single pretrained MoSE model can support a more continuous spectrum of accuracy-compute trade-offs at inference time. We present a simple and stable training recipe for slimmable experts under sparse routing, combining multi-width training with standard MoE objectives. During inference, we explore strategies for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Mobile Crowdsensing and Crowdsourcing · Domain Adaptation and Few-Shot Learning
