SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization
Yeonsik Park, Hyeonseong Kim, Seungkyu Choi

TL;DR
SERQ introduces a saliency-aware low-rank error reconstruction technique for low-bit LLM quantization, significantly improving accuracy and efficiency in W4A4 and W4A8 settings with minimal latency overhead.
Contribution
It proposes a novel saliency-aware low-rank error reconstruction method that enhances low-bit LLM inference accuracy while maintaining computational efficiency.
Findings
Outperforms prior error reconstruction methods in W4A8 and W4A4 settings.
Achieves higher accuracy than state-of-the-art rotation-based W4A4 approaches.
Reduces calibration complexity significantly.
Abstract
Post-training quantization (PTQ) has emerged as a prevailing technique for deploying large language models (LLMs) efficiently in terms of both memory and computation, across edge devices and server platforms. Existing PTQ methods primarily aim to reduce precision in weights and activations by mitigating quantization errors caused by channel-wise outlier activations (e.g., pre-quantization scaling, online transformations, or low-rank error reconstruction). Among these approaches, error reconstruction with low-rank adaptation (LoRA) has proven particularly effective, as it introduces a lightweight auxiliary computation path without requiring heavy optimization or additional online layers. However, prior studies reveal severe accuracy degradation under W4A4 settings, and conventional low-rank adaptations rely on two sequential factors, necessitating intermediate quantization during…
Peer Reviews
Decision·ICLR 2026 Poster
* Well-written with good presentation * Systematic evaluation and good results
* Missing related work * Missing theoretical justification
- The motivation is clear and the method is straightforward but effective - The evaluation includes both model performance and hardware results
- The evaluation lacks model larger than 13B: all results were collected on model between 1B and 13B. It will be convincing if the author could offer results on larger models like 70B to verify the scalability of SERQ - Details of runtime/hardware performance needs further clarification. Could the author elaborate more how a SERQ layer is accelerated on GPU? are there kernel fusions?
● Novel Design for Latency Reduction: The paper proposes a scheme that combines a single compensation matrix with an offline permutation, with the goal of eliminating the latency overhead found in conventional two-factor error correction methods. ● Presents Experimental Results on Multiple Models: The paper reports experimental results across several modern LLMs and includes latency measurements on recent hardware to support its efficiency claims. ● Addresses a Significant Problem: The work targ
While the paper presents some interesting ideas, it suffers from significant flaws in its theoretical grounding, experimental validation, and practical considerations, which severely undermine the credibility and value of its contributions. 1. Critically Flawed and Incomplete Experimental Validation ● Flawed Experimental Comparisons: The paper's main results are built on unfair comparisons. For instance, SERQ is allocated a higher bit-budget (4.37 bits) than its main competitors (~4 bits), which
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Natural Language Processing Techniques
