BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models
Sai Babu Patarlapalli, Surya Teja Avvaru

TL;DR
BitCal-TTS is a runtime controller for quantized reasoning models that improves accuracy and reduces premature stopping without fine-tuning, using online uncertainty proxies and bit-aware confidence rescaling.
Contribution
It introduces a lightweight, no-fine-tuning method for adaptive test-time scaling in 4-bit quantized models, enhancing reasoning accuracy and efficiency.
Findings
Improves exact-match accuracy on GSM8K evaluation shards at 7B and 14B scales.
Reduces premature stopping rate from 14.8% to 11.1% on 7B models.
Maintains substantial token savings compared to fixed-budget decoding.
Abstract
Post-training quantization makes large reasoning models practical under tight memory and latency budgets, but it can distort the online signals that drive adaptive test-time compute allocation. Under a fixed cap on the number of newly generated tokens, miscalibrated confidence can lead to harmful early halting: the model may surface a plausible final line while the underlying reasoning is still wrong, or the controller may stop before the trace has stabilized. We study this interaction for greedy 4-bit inference and propose BitCal-TTS, a lightweight runtime controller that combines (i) inexpensive online proxies for token-level uncertainty and reasoning-trace stability, (ii) a bit-conditioned confidence rescaling that is conservative at low nominal precision, and (iii) a bit-aware post-marker confirmation horizon designed for GSM8K-style structured outputs. The method requires no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
