MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization
Le Su, Xing Luo, Zhi Jin

TL;DR
This paper introduces MARR, a module-adaptive residual reconstruction method for low-bit post-training quantization, which dynamically balances residual correction and bias to improve model performance.
Contribution
The paper proposes a novel module-specific residual scaling approach with an adaptive PID strategy to enhance quantization accuracy across different modules.
Findings
Achieves up to 20.2% performance gains on LLMs.
Achieves up to 4.6% relative gains on ViTs.
Demonstrates effectiveness under 4-bit quantization.
Abstract
Recently, residual reconstruction-based model quantization methods have achieved promising performance in low-bit post-training quantization (PTQ) by introducing cross-layer residuals to reduce error accumulated from previous layers.However, these residuals may also introduce additional bias arising from the Hessian-approximation (HA) assumption underlying reconstruction-based PTQ, leading to suboptimal quantization performance.In this work, we analyze that multiplying the residual term by a scaling coefficient provides a direct way to mitigate the HA bias associated with residual strength, while preserving accumulated-error correction. More importantly, we observe that this trade-off is module-dependent, making a single global residual strength insufficient to balance effective correction and residual-related bias across modules.Based on these observations, we propose Module-Adaptive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
