Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques
Shwai He, Daize Dong, Liang Ding, Ang Li

TL;DR
This paper investigates various compression techniques for Mixture of Experts models, proposing aggressive pruning strategies like Layer Drop and Block Drop, combined with expert slimming, to significantly improve efficiency while maintaining high performance.
Contribution
It introduces novel aggressive pruning methods and a comprehensive compression recipe for MoE models, enhancing scalability and efficiency without sacrificing accuracy.
Findings
Achieved 6.05x speedup and 77.1% memory reduction
Maintained over 92% of original performance
Demonstrated effectiveness on Mixtral-8x7B
Abstract
Scaling large language models has driven remarkable advancements across various domains, yet the continual increase in model size presents significant challenges for real-world deployment. The Mixture of Experts (MoE) architecture offers a promising solution by dynamically selecting and activating only a subset of experts during inference, thus substantially reducing computational costs while preserving high performance. Despite these benefits, MoE introduces new inefficiencies, such as excessive parameters and communication overhead. In this work, we present a holistic study of compression techniques for Mixture of Experts to enhance both efficiency and scalability. While recent efforts have focused on Expert Trimming, which reduces the number of experts, these approaches still suffer from considerable communication and computational costs. To address this, we propose more aggressive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques
MethodsMixture of Experts
