MoQE: Improve Quantization Model performance via Mixture of Quantization Experts

Jinhao Zhang; Yunquan Zhang; Boyang Zhang; Zeyu Liu; Daning Cheng

arXiv:2508.09204·cs.LG·September 30, 2025

MoQE: Improve Quantization Model performance via Mixture of Quantization Experts

Jinhao Zhang, Yunquan Zhang, Boyang Zhang, Zeyu Liu, Daning Cheng

PDF

Open Access

TL;DR

MoQE introduces a Mixture-of-Experts framework for quantization that enhances model accuracy on resource-constrained devices by dynamically selecting specialized quantization experts, achieving state-of-the-art performance with minimal latency increase.

Contribution

The paper presents MoQE, a novel quantization inference framework using MoE architecture to improve accuracy and reduce degradation in quantized models.

Findings

01

MoQE achieves comparable performance to SOTA quantization models.

02

It effectively reduces accuracy loss in quantized models.

03

The framework maintains low inference latency.

Abstract

Quantization method plays a crucial role in improving model efficiency and reducing deployment costs, enabling the widespread application of deep learning models on resource-constrained devices. However, the quantization process inevitably introduces accuracy degradation. In this paper, we propose Mixture of Quantization Experts( abbr. MoQE), a quantization inference framework based on the Mixture-of-Experts (MoE) architecture, aiming to jointly improve the performance of quantization models. MoQE combines multiple quantization variants of one full-precision model as specialized "quantization experts" and dynamically routes input data to the most suitable expert based on its characteristics. MoQE alleviates the performance degradation commonly seen in single quantization models through specialization quantization expert models. We design lightweight, structure-aware router models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Fault Detection and Control Systems · AI in cancer detection