EAC-MoE: Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models
Yuanteng Chen, Yuantian Shao, Peisong Wang, Jian Cheng

TL;DR
EAC-MoE introduces a novel approach for large language models that reduces memory usage and accelerates inference by calibrating expert selection and pruning less-used experts, addressing key challenges in MoE systems.
Contribution
The paper presents EAC-MoE, a new expert-selection aware compressor that combines quantization calibration and expert pruning to enhance MoE-LLMs efficiency.
Findings
Reduces GPU memory consumption significantly.
Improves inference speed with minimal performance loss.
Effectively calibrates expert selection bias in MoE models.
Abstract
Mixture-of-Experts (MoE) has demonstrated promising potential in scaling LLMs. However, it is hindered by two critical challenges: (1) substantial GPU memory consumption to load all experts; (2) low activated parameters cannot be equivalently translated into inference acceleration effects. In this work, we propose EAC-MoE, an Expert-Selection Aware Compressor for MoE-LLMs, which deeply aligns with the characteristics of MoE from the perspectives of quantization and pruning, and introduces two modules to address these two challenges respectively: (1) The expert selection bias caused by low-bit quantization is a major factor contributing to the performance degradation in MoE-LLMs. Based on this, we propose Quantization with Expert-Selection Calibration (QESC), which mitigates the expert selection bias by calibrating the routers within the MoE; (2) There are always certain experts that are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Mobile Crowdsensing and Crowdsourcing · Advanced Neural Network Applications
