MoQE: Improve Quantization Model performance via Mixture of Quantization Experts
Jinhao Zhang, Yunquan Zhang, Boyang Zhang, Zeyu Liu, Daning Cheng

TL;DR
MoQE introduces a Mixture-of-Experts framework for quantization that enhances model accuracy on resource-constrained devices by dynamically selecting specialized quantization experts, achieving state-of-the-art performance with minimal latency increase.
Contribution
The paper presents MoQE, a novel quantization inference framework using MoE architecture to improve accuracy and reduce degradation in quantized models.
Findings
MoQE achieves comparable performance to SOTA quantization models.
It effectively reduces accuracy loss in quantized models.
The framework maintains low inference latency.
Abstract
Quantization method plays a crucial role in improving model efficiency and reducing deployment costs, enabling the widespread application of deep learning models on resource-constrained devices. However, the quantization process inevitably introduces accuracy degradation. In this paper, we propose Mixture of Quantization Experts( abbr. MoQE), a quantization inference framework based on the Mixture-of-Experts (MoE) architecture, aiming to jointly improve the performance of quantization models. MoQE combines multiple quantization variants of one full-precision model as specialized "quantization experts" and dynamically routes input data to the most suitable expert based on its characteristics. MoQE alleviates the performance degradation commonly seen in single quantization models through specialization quantization expert models. We design lightweight, structure-aware router models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Fault Detection and Control Systems · AI in cancer detection
