LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss

Euntae Choi; Sumin Song; Sungjoo Yoo

arXiv:2605.08755·cs.LG·May 12, 2026

LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss

Euntae Choi, Sumin Song, Sungjoo Yoo

PDF

TL;DR

LAQuant introduces a layer-wise weight-only quantization method that enhances reasoning model accuracy and speed by addressing key factors affecting quantization performance without additional overhead.

Contribution

LAQuant presents a novel layer-wise quantization approach that improves reasoning model performance by combining domain calibration with a lookahead loss, addressing key fidelity and alignment issues.

Findings

01

LAQuant improves AIME25 Pass@1 by 15.11pp over ParoQuant.

02

LAQuant achieves a 3.42x decoding speedup over FP16.

03

LAQuant outperforms existing quantization methods on long-decoding reasoning benchmarks.

Abstract

Large reasoning models (LRMs) reach competition-level math and coding accuracy via long autoregressive decoding, making per-token decoding cost a primary deployment concern. Weight quantization is the standard tool for acceleration, but representative recipes -- including state-of-the-art end-to-end (E2E) QAT -- lose accuracy on long-decoding reasoning benchmarks despite preserving perplexity and short-decode accuracy. Through a systematic gradient-direction analysis, we identify two factors driving this gap: (i) KV-cache fidelity preservation under the QAT loss, which E2E supervision attenuates via the softmax Fisher metric; and (ii) Hessian-subspace alignment between calibration data and the deployment distribution. We propose LookAhead Quantization (LAQuant), a layer-wise weight-only QAT method that addresses both factors without online-transform overhead by combining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.