Does a Global Perspective Help Prune Sparse MoEs Elegantly?

Zeliang Zhang; Nikhil Ghosh; Jiani Liu; Bin Yu; Xiaodong Liu

arXiv:2604.06542·cs.CL·April 9, 2026

Does a Global Perspective Help Prune Sparse MoEs Elegantly?

Zeliang Zhang, Nikhil Ghosh, Jiani Liu, Bin Yu, Xiaodong Liu

PDF

TL;DR

This paper introduces GRAPE, a global redundancy-aware pruning method for sparse MoEs that dynamically allocates pruning budgets across layers, leading to improved model performance and efficiency.

Contribution

The paper presents a novel global pruning strategy for sparse MoEs that considers cross-layer redundancy, outperforming traditional uniform pruning methods.

Findings

01

GRAPE achieves up to 2.45% higher accuracy than local baselines.

02

It consistently outperforms existing pruning strategies across multiple models.

03

Improves efficiency by reducing memory consumption without sacrificing performance.

Abstract

Empirical scaling laws for language models have encouraged the development of ever-larger LLMs, despite their growing computational and memory costs. Sparse Mixture-of-Experts (MoEs) offer a promising alternative by activating only a subset of experts per forward pass, improving efficiency without sacrificing performance. However, the large number of expert parameters still leads to substantial memory consumption. Existing pruning methods typically allocate budgets uniformly across layers, overlooking the heterogeneous redundancy that arises in sparse MoEs. We propose GRAPE (Global Redundancy-Aware Pruning of Experts, a global pruning strategy that dynamically allocates pruning budgets based on cross-layer redundancy. Experiments on Mixtral-8x7B, Mixtral-8x22B, DeepSeek-MoE, Qwen-MoE, and GPT-OSS show that, under the same pruning budget, GRAPE consistently achieves the best average…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.