MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

Nurbek Tastan; Stefanos Laskaridis; Karthik Nandakumar; Samuel Horvath

arXiv:2602.06154·cs.LG·February 9, 2026

MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

Nurbek Tastan, Stefanos Laskaridis, Karthik Nandakumar, Samuel Horvath

PDF

Open Access

TL;DR

MoSE introduces slimmable experts in mixture-of-experts models, enabling flexible, continuous accuracy-compute trade-offs during inference, improving efficiency without sacrificing performance.

Contribution

It proposes a novel MoE architecture with slimmable experts, allowing variable-width execution and continuous trade-offs, along with a training recipe and runtime strategies.

Findings

01

MoSE matches or surpasses standard MoE performance at full width.

02

MoSE shifts the Pareto frontier towards better accuracy with fewer FLOPs.

03

Effective training and inference strategies for slimmable experts in MoE models.

Abstract

Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully. Hence, the trade-off between accuracy and computation in an MoE model typically exhibits large discontinuities. We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths. This enables conditional computation not only over which experts are activated, but also over how much of each expert is utilized. Consequently, a single pretrained MoSE model can support a more continuous spectrum of accuracy-compute trade-offs at inference time. We present a simple and stable training recipe for slimmable experts under sparse routing, combining multi-width training with standard MoE objectives. During inference, we explore strategies for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Mobile Crowdsensing and Crowdsourcing · Domain Adaptation and Few-Shot Learning