STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning

Jaeseong Lee; seung-won hwang; Aurick Qiao; Daniel F Campos; Zhewei Yao; Yuxiong He

arXiv:2409.06211·cs.LG·July 22, 2025

STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning

Jaeseong Lee, seung-won hwang, Aurick Qiao, Daniel F Campos, Zhewei Yao, Yuxiong He

PDF

Open Access 1 Video

TL;DR

This paper introduces a scalable structured pruning method for mixture-of-experts models that outperforms traditional unstructured pruning, significantly reducing computational costs while maintaining model performance.

Contribution

Proposes a novel scalable expert pruning approach leveraging latent structure, outperforming unstructured pruning in large language models.

Findings

01

Achieves 40% sparsity with minimal performance loss on Snowflake Arctic model.

02

Requires only one H100 GPU and two hours for pruning large models.

03

Outperforms state-of-the-art unstructured pruning in generative tasks like GSM8K.

Abstract

Mixture-of-experts (MoEs) have been adopted for reducing inference costs by sparsely activating experts in Large language models (LLMs). Despite this reduction, the massive number of experts in MoEs still makes them expensive to serve. In this paper, we study how to address this, by pruning MoEs. Among pruning methodologies, unstructured pruning has been known to achieve the highest performance for a given pruning ratio, compared to structured pruning, since the latter imposes constraints on the sparsification structure. This is intuitive, as the solution space of unstructured pruning subsumes that of structured pruning. However, our counterintuitive finding reveals that expert pruning, a form of structured pruning, can actually precede unstructured pruning to outperform unstructured-only pruning. As existing expert pruning, requiring $O (\frac{k ^{n}}{n})$ forward passes for $n$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning· underline

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Robotics and Automated Systems

MethodsPruning · Mixture of Experts