Does a Global Perspective Help Prune Sparse MoEs Elegantly?
Zeliang Zhang, Nikhil Ghosh, Jiani Liu, Bin Yu, Xiaodong Liu

TL;DR
This paper introduces GRAPE, a global redundancy-aware pruning method for sparse MoEs that dynamically allocates pruning budgets across layers, leading to improved model performance and efficiency.
Contribution
The paper presents a novel global pruning strategy for sparse MoEs that considers cross-layer redundancy, outperforming traditional uniform pruning methods.
Findings
GRAPE achieves up to 2.45% higher accuracy than local baselines.
It consistently outperforms existing pruning strategies across multiple models.
Improves efficiency by reducing memory consumption without sacrificing performance.
Abstract
Empirical scaling laws for language models have encouraged the development of ever-larger LLMs, despite their growing computational and memory costs. Sparse Mixture-of-Experts (MoEs) offer a promising alternative by activating only a subset of experts per forward pass, improving efficiency without sacrificing performance. However, the large number of expert parameters still leads to substantial memory consumption. Existing pruning methods typically allocate budgets uniformly across layers, overlooking the heterogeneous redundancy that arises in sparse MoEs. We propose GRAPE (Global Redundancy-Aware Pruning of Experts, a global pruning strategy that dynamically allocates pruning budgets based on cross-layer redundancy. Experiments on Mixtral-8x7B, Mixtral-8x22B, DeepSeek-MoE, Qwen-MoE, and GPT-OSS show that, under the same pruning budget, GRAPE consistently achieves the best average…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
