The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning
Henry Han, Xiyang Liu, Xiaodong Wang, Fei Han, Xiaodong Li

TL;DR
This paper shows that reducing numerical precision in multi-hop reasoning can paradoxically increase energy consumption and decrease accuracy due to hardware and latency bottlenecks, breaking traditional scaling laws.
Contribution
It reveals the 'quantization trap' in multi-hop reasoning, providing a theoretical decomposition and a predictive model for when scaling laws fail.
Findings
Reducing precision from 16-bit to 8/4-bit increases energy use and degrades accuracy.
Hardware casting overhead and dequantization latency are key bottlenecks.
A Critical Model Scale predicts when the quantization trap occurs across hardware and model sizes.
Abstract
Neural scaling laws provide a predictable recipe for AI advancement: reducing numerical precision should linearly improve computational efficiency and energy profile (). In this paper, we demonstrate that this scaling law breaks in the context of multi-hop reasoning. We reveal a 'quantization trap' where reducing precision from 16-bit to 8/4-bit paradoxically increases net energy consumption while degrading reasoning accuracy. We provide a rigorous theoretical decomposition that attributes this failure to hardware casting overhead, the hidden latency cost of dequantization kernels, which becomes a dominant bottleneck in sequential reasoning chains, as well as to a sequential energy amortization failure. As a result, scaling law breaking is unavoidable in practice. We formalize a Critical Model Scale that predicts when the trap dissolves or deepens as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
