SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization

Yeonsik Park; Hyeonseong Kim; Seungkyu Choi

arXiv:2603.08185·cs.LG·March 10, 2026

SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization

Yeonsik Park, Hyeonseong Kim, Seungkyu Choi

PDF

Open Access 3 Reviews

TL;DR

SERQ introduces a saliency-aware low-rank error reconstruction technique for low-bit LLM quantization, significantly improving accuracy and efficiency in W4A4 and W4A8 settings with minimal latency overhead.

Contribution

It proposes a novel saliency-aware low-rank error reconstruction method that enhances low-bit LLM inference accuracy while maintaining computational efficiency.

Findings

01

Outperforms prior error reconstruction methods in W4A8 and W4A4 settings.

02

Achieves higher accuracy than state-of-the-art rotation-based W4A4 approaches.

03

Reduces calibration complexity significantly.

Abstract

Post-training quantization (PTQ) has emerged as a prevailing technique for deploying large language models (LLMs) efficiently in terms of both memory and computation, across edge devices and server platforms. Existing PTQ methods primarily aim to reduce precision in weights and activations by mitigating quantization errors caused by channel-wise outlier activations (e.g., pre-quantization scaling, online transformations, or low-rank error reconstruction). Among these approaches, error reconstruction with low-rank adaptation (LoRA) has proven particularly effective, as it introduces a lightweight auxiliary computation path without requiring heavy optimization or additional online layers. However, prior studies reveal severe accuracy degradation under W4A4 settings, and conventional low-rank adaptations rely on two sequential factors, necessitating intermediate quantization during…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 2

Strengths

* Well-written with good presentation * Systematic evaluation and good results

Weaknesses

* Missing related work * Missing theoretical justification

Reviewer 02Rating 8Confidence 4

Strengths

- The motivation is clear and the method is straightforward but effective - The evaluation includes both model performance and hardware results

Weaknesses

- The evaluation lacks model larger than 13B: all results were collected on model between 1B and 13B. It will be convincing if the author could offer results on larger models like 70B to verify the scalability of SERQ - Details of runtime/hardware performance needs further clarification. Could the author elaborate more how a SERQ layer is accelerated on GPU? are there kernel fusions?

Reviewer 03Rating 4Confidence 4

Strengths

● Novel Design for Latency Reduction: The paper proposes a scheme that combines a single compensation matrix with an offline permutation, with the goal of eliminating the latency overhead found in conventional two-factor error correction methods. ● Presents Experimental Results on Multiple Models: The paper reports experimental results across several modern LLMs and includes latency measurements on recent hardware to support its efficiency claims. ● Addresses a Significant Problem: The work targ

Weaknesses

While the paper presents some interesting ideas, it suffers from significant flaws in its theoretical grounding, experimental validation, and practical considerations, which severely undermine the credibility and value of its contributions. 1. Critically Flawed and Incomplete Experimental Validation ● Flawed Experimental Comparisons: The paper's main results are built on unfair comparisons. For instance, SERQ is allocated a higher bit-budget (4.37 bits) than its main competitors (~4 bits), which

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Natural Language Processing Techniques