Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training

Can Jin; Hongwu Peng; Mingcan Xiang; Qixin Zhang; Xiangchi Yuan; Amit Hasan; Ohiremen Dibua; Yifan Gong; Yan Kang; Dimitris N. Metaxas

arXiv:2512.13996·cs.AI·December 17, 2025

Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training

Can Jin, Hongwu Peng, Mingcan Xiang, Qixin Zhang, Xiangchi Yuan, Amit Hasan, Ohiremen Dibua, Yifan Gong, Yan Kang, Dimitris N. Metaxas

PDF

Open Access

TL;DR

This paper introduces DTop-p MoE, a dynamic routing method for sparse Mixture-of-Experts models that controls sparsity through a PI controller, improving efficiency and adaptability in large model pre-training.

Contribution

It proposes a novel dynamic Top-p routing mechanism with a PI controller for adaptive sparsity control and layer-wise routing normalization, advancing MoE scalability and efficiency.

Findings

01

DTop-p outperforms Top-k and fixed-threshold Top-p baselines.

02

It maintains precise expert activation control across tokens and layers.

03

Demonstrates strong scaling with model size and dataset complexity.

Abstract

Sparse Mixture-of-Experts (MoE) architectures effectively scale model capacity by activating only a subset of experts for each input token. However, the standard Top-k routing strategy imposes a uniform sparsity pattern that ignores the varying difficulty of tokens. While Top-p routing offers a flexible alternative, existing implementations typically rely on a fixed global probability threshold, which results in uncontrolled computational costs and sensitivity to hyperparameter selection. In this paper, we propose DTop-p MoE, a sparsity-controllable dynamic Top-p routing mechanism. To resolve the challenge of optimizing a non-differentiable threshold, we utilize a Proportional-Integral (PI) Controller that dynamically adjusts the probability threshold to align the running activated-expert sparsity with a specified target. Furthermore, we introduce a dynamic routing normalization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Mobile Crowdsensing and Crowdsourcing · Advanced Graph Neural Networks