Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
Haoze He, Xingyuan Ding, Xuan Jiang, Xinkai Zou, Alex Cheng, Yibo Zhao, Juncheng Billy Li, and Heather Miller

TL;DR
This paper introduces a novel MoE fine-tuning method that preserves long-tailed expert information by combining bias-driven sparsification with always-active experts, improving performance on reasoning benchmarks.
Contribution
It proposes an auxiliary-loss-free MoE fine-tuning framework that maintains long-tailed expert knowledge without noisy gradients, outperforming existing methods.
Findings
Outperforms DenseMixer and ESFT baselines by 2.5%+ on reasoning benchmarks.
Preserves long-tailed expert information effectively under sparse routing.
Enhances expert activation stability and knowledge consolidation.
Abstract
Despite MoE models leading many benchmarks, supervised fine-tuning (SFT) for the MoE architectures remains difficult because its router layers are fragile. Methods such as DenseMixer and ESFT mitigate router collapse with dense mixing or auxiliary load-balancing losses, but these introduce noisy gradients that often degrade performance. In preliminary experiments, we systematically pruned experts and observed that while certain super experts are activated far more frequently, discarding less used experts still leads to notable performance degradation. This suggests that even rarely activated experts encode non-trivial knowledge useful for downstream tasks. Motivated by this, we propose an auxiliary-loss-free MoE SFT framework that combines bias-driven sparsification with always-active gated condenser experts. Rather than enforcing balanced activation across all experts, our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
