Mixture of Experts for Low-Resource LLMs

Ori Bar Joseph; Smadar Arvatz; Noam Kayzer; Dan Revital; Sarel Weinberger

arXiv:2605.17598·cs.CL·May 19, 2026

Mixture of Experts for Low-Resource LLMs

Ori Bar Joseph, Smadar Arvatz, Noam Kayzer, Dan Revital, Sarel Weinberger

PDF

TL;DR

This paper investigates how mixture-of-experts models route tokens in low-resource languages, revealing a collapse in expert usage that can be mitigated by continual pre-training, leading to improved multilingual performance.

Contribution

It provides the first detailed analysis of routing dynamics in MoE models for low-resource languages and demonstrates effective strategies to improve multilingual routing and performance.

Findings

01

Routing collapse occurs in low-resource languages, concentrating tokens on few experts.

02

Continual pre-training on balanced data increases routing entropy and expert sharing.

03

Routing improvements lead to better downstream benchmark results.

Abstract

Mixture-of-Experts (MoE) architectures enable efficient model scaling, yet expert routing behavior across underrepresented languages remains poorly understood. We analyze routing dynamics in two architecturally distinct MoE models -- a pure Transformer (Qwen3-30B-A3B) and a hybrid Mamba-Transformer (Nemotron-3-Nano-30B-A3B) -- using Hebrew as a morphologically rich, low-resource testbed. Both pre-trained models exhibit \emph{deep-layer routing collapse}: usage entropy drops sharply in final layers and tokens concentrate on a narrow expert subset, a pattern largely absent for English. Continual pre-training (CPT) on balanced bilingual data substantially corrects this imbalance, increasing entropy and shifting routing toward shared, language-agnostic experts; supervised fine-tuning (SFT) alone achieves less complete correction. Extending the analysis to Japanese reveals quantitatively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.