KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models

Zukang Xu; Zhixiong Zhao; Xing Hu; Zhixuan Chen; Dawei Yang

arXiv:2602.11184·cs.LG·February 25, 2026

KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models

Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces KBVQ-MoE, a novel vector quantization framework that significantly improves ultra-low-bit compression of MoE large language models by addressing redundancy and bias issues, enabling efficient deployment.

Contribution

The paper proposes KBVQ-MoE, combining KLT-guided SVD and bias correction to enhance low-bit quantization for MoE LLMs, which was not addressed in prior work.

Findings

01

3-bit quantization achieves nearly the same accuracy as FP16.

02

KBVQ-MoE outperforms existing quantization methods.

03

Enables efficient deployment on resource-constrained devices.

Abstract

Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation. However, their enormous parameter sizes and memory demands pose major challenges for deployment in resource-constrained environments. Vector Quantization (VQ) offers a promising approach for ultra-low-bit compression in Large Language Models (LLMs) by leveraging a codebook, where weight vectors are mapped to the most similar discrete codewords. Yet, directly applying VQ to MoEs often leads to substantial performance degradation due to two critical obstacles: (1) redundant representations among experts cause VQ to repeatedly quantize similar representations for each expert, resulting in inefficient use of limited codebook capacity; and (2) cumulative output bias is amplified by expert aggregation in MoE layers,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. The motivation is clear and convincing, highlighting MoE-specific issues of expert redundancy and output bias. 2. The proposed KBVQ-MoE is validated on several representative MoE architectures, demonstrating consistent improvements.

Weaknesses

1. While IDRE and BCOS are ablated individually, there is no fine-grained study of codebook size sensitivity. 2. Lack more advanced or concurrent MoE-aware compression baselines (e.g., D2-MoE, SubMoE mentioned in related work). 3. The evaluation of computational efficiency is insufficient. The paper only reports a simple “Decoder speed test” in Table 6, without providing detailed analysis of computational or memory overhead. 4. The core motivation of the paper lies in the claim that redundanc

Reviewer 02Rating 6Confidence 3

Strengths

1. Paper is easy to understand and well written. 2. Technique proposed is intuitive.

Weaknesses

1. The paper is missing comparison with recent non linear quantization baselines : VPTQ (https://arxiv.org/abs/2409.17066), AQLM (https://arxiv.org/pdf/2401.06118), QUIP (https://arxiv.org/pdf/2307.13304), QUIP# (https://arxiv.org/pdf/2402.04396), SqueezeLLM (https://arxiv.org/pdf/2306.07629), GPTVQ (https://arxiv.org/pdf/2402.15319), etc. 2. Among the baselines presented, the compression achieved by various techniques is missing. 3. Iso-compression results are missing. 4. Evaluation on comple

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper makes a novel contribution by adapting vector quantization specifically for MoE architectures. The identification of expert redundancy and amplified quantization bias as key bottlenecks is insightful, and the KLT-guided SVD approach creatively aligns weight decomposition with input activation statistics. 2. The technical approach is sound with theoretical justifications provided in the appendices. The experimental evaluation is comprehensive, covering multiple MoE models with thorou

Weaknesses

1. The paper mentions "negligible" computational overhead but provides limited quantitative analysis. How long does the KLT-SVD calibration take compared to standard VQ? What is the actual inference-time cost of the channel-wise bias correction operations? These practical considerations matter for deployment. Could you provide some results on this, like time cost of quantization method. 2. The baseline methods, especially MoEQuant, show surprisingly poor performance at 2-bit in Table 1 (e.g., W2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques