QuantMoE-Bench: Examining Post-Training Quantization for Mixture-of-Experts
Pingzhi Li, Xiaolong Jin, Zhen Tan, Yu Cheng, Tianlong Chen

TL;DR
This paper investigates fine-grained, structure-aware post-training quantization for Mixture-of-Experts models, demonstrating improved performance and proposing data-driven bit allocation techniques across multiple tasks.
Contribution
It introduces a novel, structure-aware quantization approach for MoE models and develops data-driven methods for optimized bit allocation, achieving state-of-the-art results.
Findings
Fined-grained mixed precision quantization improves MoE model performance.
Structure-aware quantization requires varying bits for different MoE components.
Achieved 65.35% accuracy, surpassing baseline performance.
Abstract
Mixture-of-Experts (MoE) is a promising way to scale up the learning capacity of large language models. It increases the number of parameters while keeping FLOPs nearly constant during inference through sparse activation. Yet, it still suffers from significant memory overheads due to the vast parameter size, necessitating model compression techniques. Post-training quantization offers a powerful approach for model compression. Existing methods adopt a fixed quantization precision for the entire MoE model. This rigid setup can lead to suboptimal performance, without considering the inherent sparse structure. For example, MoE's sparse routing mechanism leads to different activation patterns, where shared experts are accessed by all tokens while token-conditioned experts are selectively activated. This activation disparity suggests different quantization requirements, with consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗mratsim/MiniMax-M2.5-BF16-INT4-AWQmodel· 18k dl· ♡ 3818k dl♡ 38
- 🤗mratsim/MiniMax-M2.5-FP8-INT4-AWQmodel· 10.0k dl· ♡ 1910.0k dl♡ 19
- 🤗mratsim/GLM-4.5-Iceblink-106B-A12B-AWQmodel· 2 dl2 dl
- 🤗mratsim/GLM-Steam-106B-A12B-v1-AWQmodel· 5 dl5 dl
- 🤗mratsim/GLM-4.5-Iceblink-v2-106B-A12B-AWQmodel· 1 dl1 dl
- 🤗mratsim/GLM-4.6-EXL3model· 7 dl· ♡ 47 dl♡ 4
- 🤗mratsim/GLM-4.5-Iceblink-v2-106B-A12B-FP8model· 8 dl· ♡ 18 dl♡ 1
- 🤗mratsim/GLM-Steam-106B-A12B-v1-FP8model· 2 dl2 dl
- 🤗mratsim/GLM-4.5-Iceblink-106B-A12B-FP8model· 2 dl· ♡ 12 dl♡ 1
- 🤗mratsim/GLM-4.7-EXL3model· 36 dl· ♡ 2036 dl♡ 20
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsForecasting Techniques and Applications · Advanced Bandit Algorithms Research
MethodsMixture of Experts
