MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE
Geng Zhang, Yuxuan Han, Yuxuan Lou, Yiqi Zhang, Wangbo Zhao, Yang You

TL;DR
MoNE introduces a novel expert pruning approach that replaces redundant experts with lightweight novices, significantly reducing memory costs while maintaining high model performance across various tasks.
Contribution
This paper presents MoNE, a new expert pruning method that evaluates redundancy and replaces experts with lightweight novices, improving robustness and efficiency in structured pruning of MoE models.
Findings
MoNE outperforms baseline methods with up to 2.72% accuracy gain.
Minimal performance drop of 0.14% on Qwen2-57B-A14B under 25% pruning.
Effective across multiple downstream tasks and model architectures.
Abstract
Mixture-of-Experts (MoE) enables efficient scaling of large language models by activating only a subset of experts per input token. However, deploying MoE-based models incurs significant memory overhead due to the need to retain all experts in memory. While structured pruning is promising to reduce memory costs, existing methods often show suboptimal performance and unstable degradation in three dimensions: model architectures, calibration data sources, and calibration sample sizes. This paper proposes Mixture-of-Novices-and-Experts (MoNE), a novel expert pruning method that replaces redundant experts with lightweight novices to achieve effective and robust model compression. MoNE evaluates expert redundancy based on two metrics: access frequency and output variance. Experts exhibiting low usage and stable outputs are pruned and replaced with lightweight novices-unbiased estimations of…
Peer Reviews
Decision·ICLR 2026 Poster
1. The core idea is intuitive, simple, and training-free. The fused metric (frequency + variance) is well-justified, and the "novice" replacement (the expert's mean output) is an effective closed-form solution to minimize output discrepancy. 2. The experimental validation is a major strength. Testing on five different MoE architectures with varying sizes (7B to 57B parameters) demonstrates the method works across scales. The robustness evaluation across model architectures, calibration data sou
1. There is a lack of specialized tasks (e.g., coding, math) in evaluation. It's unclear if the redundancy metric, calibrated on general text, might inadvertently prune experts that are critical for these specialized capabilities. 2. The paper doesn't explain or ablate the benefit of computing a dynamic, per-token gate for a static, constant "novice" vector. This appears computationally redundant.
* Simple, compute-friendly pruning primitive (constant novices) that retains router behavior and keeps overhead close to removal. * Consistent gains/robustness across models and calibration setups; headline numbers are competitive.
* The ablation in Figure 4 is intersting, seems like the variance metric can bring improvement without the novice. It would be helpful to include more comprehensive ablation, i.e. more combination (e.g. only frequency and only variance) to show the gain from each part. * The novice is the unbiased mean output of a pruned expert (a constant vector), similar to FLAP’s use of averaged activations for compensation but at a different granularity. The paper should more explicitly discuss the relatio
1. This paper propose a novel expert pruning method named MoNE which replaces redundant experts with lightweight novices to compress MoE models with minimal performance loss 2. This paper uses expert access frequency and output variance to measure redundancy, and unbiased output estimation to minimize post-pruning discrepancy, yielding effective and robust pruning.
1. The combinatorial forms of frequency and variance adjacency matrices require ablation, such as weighted summation. 2. Replacing experts with constant vectors may reduce expressiveness; could learnable vectors or biases be used instead? 3. Could comparative experiments on pruning strategies (without finetuning) be provided to demonstrate the superiority of the proposed frequency- and variance-based pruning strategy?
Code & Models
- 🤗MoNE-Pruning/DeepSeek-V2-Lite-MoNE-48-c4-1000model· 38 dl38 dl
- 🤗MoNE-Pruning/DeepSeek-V2-Lite-MoNE-48-zyda2-1000model· 5 dl5 dl
- 🤗MoNE-Pruning/DeepSeek-V2-Lite-MoNE-48-zyda2-100model· 4 dl4 dl
- 🤗MoNE-Pruning/Qwen2-57B-A14B-MoNE-48-zyda2-100model· 3 dl3 dl
- 🤗MoNE-Pruning/Qwen2-57B-A14B-Instruct-MoNE-48-math-100model· 25 dl25 dl
- 🤗MoNE-Pruning/Qwen2-57B-A14B-Instruct-MoNE-48-gsm8k-100model· 25 dl25 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Mobile Crowdsensing and Crowdsourcing
