MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language   Models via Expert-Balanced Sampling and Affinity Guidance

Xing Hu; Zhixuan Chen; Dawei Yang; Zukang Xu; Chen Xu; Zhihang Yuan,; Sifan Zhou; Jiangyong Yu

arXiv:2505.03804·cs.LG·May 8, 2025

MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance

Xing Hu, Zhixuan Chen, Dawei Yang, Zukang Xu, Chen Xu, Zhihang Yuan,, Sifan Zhou, Jiangyong Yu

PDF

Open Access

TL;DR

MoEQuant introduces expert-balanced sampling and affinity-guided quantization to improve the accuracy and efficiency of quantizing mixture-of-experts large language models, addressing imbalance challenges.

Contribution

This paper presents MoEQuant, a novel quantization framework specifically designed for MoE LLMs, with techniques to balance expert sampling and incorporate affinity information.

Findings

01

Achieves over 10 points accuracy gain in HumanEval for DeepSeekMoE-16B under 4-bit quantization.

02

Effectively addresses inter- and intra-expert imbalance issues in MoE quantization.

03

Enhances model efficiency while maintaining high performance.

Abstract

Mixture-of-Experts (MoE) large language models (LLMs), which leverage dynamic routing and sparse activation to enhance efficiency and scalability, have achieved higher performance while reducing computational costs. However, these models face significant memory overheads, limiting their practical deployment and broader adoption. Post-training quantization (PTQ), a widely used method for compressing LLMs, encounters severe accuracy degradation and diminished generalization performance when applied to MoE models. This paper investigates the impact of MoE's sparse and dynamic characteristics on quantization and identifies two primary challenges: (1) Inter-expert imbalance, referring to the uneven distribution of samples across experts, which leads to insufficient and biased calibration for less frequently utilized experts; (2) Intra-expert imbalance, arising from MoE's unique aggregation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Data and IoT Technologies