LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss
Euntae Choi, Sumin Song, Sungjoo Yoo

TL;DR
LAQuant introduces a layer-wise weight-only quantization method that enhances reasoning model accuracy and speed by addressing key factors affecting quantization performance without additional overhead.
Contribution
LAQuant presents a novel layer-wise quantization approach that improves reasoning model performance by combining domain calibration with a lookahead loss, addressing key fidelity and alignment issues.
Findings
LAQuant improves AIME25 Pass@1 by 15.11pp over ParoQuant.
LAQuant achieves a 3.42x decoding speedup over FP16.
LAQuant outperforms existing quantization methods on long-decoding reasoning benchmarks.
Abstract
Large reasoning models (LRMs) reach competition-level math and coding accuracy via long autoregressive decoding, making per-token decoding cost a primary deployment concern. Weight quantization is the standard tool for acceleration, but representative recipes -- including state-of-the-art end-to-end (E2E) QAT -- lose accuracy on long-decoding reasoning benchmarks despite preserving perplexity and short-decode accuracy. Through a systematic gradient-direction analysis, we identify two factors driving this gap: (i) KV-cache fidelity preservation under the QAT loss, which E2E supervision attenuates via the softmax Fisher metric; and (ii) Hessian-subspace alignment between calibration data and the deployment distribution. We propose LookAhead Quantization (LAQuant), a layer-wise weight-only QAT method that addresses both factors without online-transform overhead by combining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
