Towards Efficient Mixture of Experts: A Holistic Study of Compression   Techniques

Shwai He; Daize Dong; Liang Ding; Ang Li

arXiv:2406.02500·cs.LG·March 18, 2025·3 cites

Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques

Shwai He, Daize Dong, Liang Ding, Ang Li

PDF

Open Access 1 Repo

TL;DR

This paper investigates various compression techniques for Mixture of Experts models, proposing aggressive pruning strategies like Layer Drop and Block Drop, combined with expert slimming, to significantly improve efficiency while maintaining high performance.

Contribution

It introduces novel aggressive pruning methods and a comprehensive compression recipe for MoE models, enhancing scalability and efficiency without sacrificing accuracy.

Findings

01

Achieved 6.05x speedup and 77.1% memory reduction

02

Maintained over 92% of original performance

03

Demonstrated effectiveness on Mixtral-8x7B

Abstract

Scaling large language models has driven remarkable advancements across various domains, yet the continual increase in model size presents significant challenges for real-world deployment. The Mixture of Experts (MoE) architecture offers a promising solution by dynamically selecting and activating only a subset of experts during inference, thus substantially reducing computational costs while preserving high performance. Despite these benefits, MoE introduces new inefficiencies, such as excessive parameters and communication overhead. In this work, we present a holistic study of compression techniques for Mixture of Experts to enhance both efficiency and scalability. While recent efforts have focused on Expert Trimming, which reduces the number of experts, these approaches still suffer from considerable communication and computational costs. To address this, we propose more aggressive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

daizedong/unified-moe-compression
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques

MethodsMixture of Experts