QuantMoE-Bench: Examining Post-Training Quantization for   Mixture-of-Experts

Pingzhi Li; Xiaolong Jin; Zhen Tan; Yu Cheng; Tianlong Chen

arXiv:2406.08155·cs.LG·February 26, 2025·1 cites

QuantMoE-Bench: Examining Post-Training Quantization for Mixture-of-Experts

Pingzhi Li, Xiaolong Jin, Zhen Tan, Yu Cheng, Tianlong Chen

PDF

Open Access 1 Repo 10 Models

TL;DR

This paper investigates fine-grained, structure-aware post-training quantization for Mixture-of-Experts models, demonstrating improved performance and proposing data-driven bit allocation techniques across multiple tasks.

Contribution

It introduces a novel, structure-aware quantization approach for MoE models and develops data-driven methods for optimized bit allocation, achieving state-of-the-art results.

Findings

01

Fined-grained mixed precision quantization improves MoE model performance.

02

Structure-aware quantization requires varying bits for different MoE components.

03

Achieved 65.35% accuracy, surpassing baseline performance.

Abstract

Mixture-of-Experts (MoE) is a promising way to scale up the learning capacity of large language models. It increases the number of parameters while keeping FLOPs nearly constant during inference through sparse activation. Yet, it still suffers from significant memory overheads due to the vast parameter size, necessitating model compression techniques. Post-training quantization offers a powerful approach for model compression. Existing methods adopt a fixed quantization precision for the entire MoE model. This rigid setup can lead to suboptimal performance, without considering the inherent sparse structure. For example, MoE's sparse routing mechanism leads to different activation patterns, where shared experts are accessed by all tokens while token-conditioned experts are selectively activated. This activation disparity suggests different quantization requirements, with consistently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

unites-lab/moe-quantization
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsForecasting Techniques and Applications · Advanced Bandit Algorithms Research

MethodsMixture of Experts