STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning
Jaeseong Lee, seung-won hwang, Aurick Qiao, Daniel F Campos, Zhewei Yao, Yuxiong He

TL;DR
This paper introduces a scalable structured pruning method for mixture-of-experts models that outperforms traditional unstructured pruning, significantly reducing computational costs while maintaining model performance.
Contribution
Proposes a novel scalable expert pruning approach leveraging latent structure, outperforming unstructured pruning in large language models.
Findings
Achieves 40% sparsity with minimal performance loss on Snowflake Arctic model.
Requires only one H100 GPU and two hours for pruning large models.
Outperforms state-of-the-art unstructured pruning in generative tasks like GSM8K.
Abstract
Mixture-of-experts (MoEs) have been adopted for reducing inference costs by sparsely activating experts in Large language models (LLMs). Despite this reduction, the massive number of experts in MoEs still makes them expensive to serve. In this paper, we study how to address this, by pruning MoEs. Among pruning methodologies, unstructured pruning has been known to achieve the highest performance for a given pruning ratio, compared to structured pruning, since the latter imposes constraints on the sparsification structure. This is intuitive, as the solution space of unstructured pruning subsumes that of structured pruning. However, our counterintuitive finding reveals that expert pruning, a form of structured pruning, can actually precede unstructured pruning to outperform unstructured-only pruning. As existing expert pruning, requiring forward passes for …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Robotics and Automated Systems
MethodsPruning · Mixture of Experts
