Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training
Can Jin, Hongwu Peng, Mingcan Xiang, Qixin Zhang, Xiangchi Yuan, Amit Hasan, Ohiremen Dibua, Yifan Gong, Yan Kang, Dimitris N. Metaxas

TL;DR
This paper introduces DTop-p MoE, a dynamic routing method for sparse Mixture-of-Experts models that controls sparsity through a PI controller, improving efficiency and adaptability in large model pre-training.
Contribution
It proposes a novel dynamic Top-p routing mechanism with a PI controller for adaptive sparsity control and layer-wise routing normalization, advancing MoE scalability and efficiency.
Findings
DTop-p outperforms Top-k and fixed-threshold Top-p baselines.
It maintains precise expert activation control across tokens and layers.
Demonstrates strong scaling with model size and dataset complexity.
Abstract
Sparse Mixture-of-Experts (MoE) architectures effectively scale model capacity by activating only a subset of experts for each input token. However, the standard Top-k routing strategy imposes a uniform sparsity pattern that ignores the varying difficulty of tokens. While Top-p routing offers a flexible alternative, existing implementations typically rely on a fixed global probability threshold, which results in uncontrolled computational costs and sensitivity to hyperparameter selection. In this paper, we propose DTop-p MoE, a sparsity-controllable dynamic Top-p routing mechanism. To resolve the challenge of optimizing a non-differentiable threshold, we utilize a Proportional-Integral (PI) Controller that dynamically adjusts the probability threshold to align the running activated-expert sparsity with a specified target. Furthermore, we introduce a dynamic routing normalization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Mobile Crowdsensing and Crowdsourcing · Advanced Graph Neural Networks
