KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang

TL;DR
This paper introduces KBVQ-MoE, a novel vector quantization framework that significantly improves ultra-low-bit compression of MoE large language models by addressing redundancy and bias issues, enabling efficient deployment.
Contribution
The paper proposes KBVQ-MoE, combining KLT-guided SVD and bias correction to enhance low-bit quantization for MoE LLMs, which was not addressed in prior work.
Findings
3-bit quantization achieves nearly the same accuracy as FP16.
KBVQ-MoE outperforms existing quantization methods.
Enables efficient deployment on resource-constrained devices.
Abstract
Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation. However, their enormous parameter sizes and memory demands pose major challenges for deployment in resource-constrained environments. Vector Quantization (VQ) offers a promising approach for ultra-low-bit compression in Large Language Models (LLMs) by leveraging a codebook, where weight vectors are mapped to the most similar discrete codewords. Yet, directly applying VQ to MoEs often leads to substantial performance degradation due to two critical obstacles: (1) redundant representations among experts cause VQ to repeatedly quantize similar representations for each expert, resulting in inefficient use of limited codebook capacity; and (2) cumulative output bias is amplified by expert aggregation in MoE layers,…
Peer Reviews
Decision·ICLR 2026 Poster
1. The motivation is clear and convincing, highlighting MoE-specific issues of expert redundancy and output bias. 2. The proposed KBVQ-MoE is validated on several representative MoE architectures, demonstrating consistent improvements.
1. While IDRE and BCOS are ablated individually, there is no fine-grained study of codebook size sensitivity. 2. Lack more advanced or concurrent MoE-aware compression baselines (e.g., D2-MoE, SubMoE mentioned in related work). 3. The evaluation of computational efficiency is insufficient. The paper only reports a simple “Decoder speed test” in Table 6, without providing detailed analysis of computational or memory overhead. 4. The core motivation of the paper lies in the claim that redundanc
1. Paper is easy to understand and well written. 2. Technique proposed is intuitive.
1. The paper is missing comparison with recent non linear quantization baselines : VPTQ (https://arxiv.org/abs/2409.17066), AQLM (https://arxiv.org/pdf/2401.06118), QUIP (https://arxiv.org/pdf/2307.13304), QUIP# (https://arxiv.org/pdf/2402.04396), SqueezeLLM (https://arxiv.org/pdf/2306.07629), GPTVQ (https://arxiv.org/pdf/2402.15319), etc. 2. Among the baselines presented, the compression achieved by various techniques is missing. 3. Iso-compression results are missing. 4. Evaluation on comple
1. The paper makes a novel contribution by adapting vector quantization specifically for MoE architectures. The identification of expert redundancy and amplified quantization bias as key bottlenecks is insightful, and the KLT-guided SVD approach creatively aligns weight decomposition with input activation statistics. 2. The technical approach is sound with theoretical justifications provided in the appendices. The experimental evaluation is comprehensive, covering multiple MoE models with thorou
1. The paper mentions "negligible" computational overhead but provides limited quantitative analysis. How long does the KLT-SVD calibration take compared to standard VQ? What is the actual inference-time cost of the channel-wise bias correction operations? These practical considerations matter for deployment. Could you provide some results on this, like time cost of quantization method. 2. The baseline methods, especially MoEQuant, show surprisingly poor performance at 2-bit in Table 1 (e.g., W2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques
