When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment
Long Zhang, Wei-neng Chen, Feng-feng Wei, Zi-bo Qin

TL;DR
This paper introduces a finite-answer stabilization framework to determine when language models' answer preferences become stable, revealing insights into their reasoning process and answer onset timing.
Contribution
It develops a computable method to identify answer stabilization in language models without relying on rollouts or learned probes, enabling precise analysis of their decision timing.
Findings
Finite-answer projection stabilizes before answer is parseable in controlled tasks.
Signal tracks the model's eventual output rather than truth.
Method is linearly recoverable from hidden summaries and transferable across contexts.
Abstract
Language models often generate reasoning before giving a final answer, but the visible answer does not reveal when the model's answer preference became stable. We study this question through a narrow computable object: \emph{finite-answer preference stabilization}. For a model state and specified answer verbalizers, we project the model's own continuation probabilities onto a finite answer set; in binary tasks this yields an exact log-odds code, . This target defines parser-based answer onset, retrospective stabilization time, and lead without relying on greedy rollouts or learned probes. In controlled delayed-verdict tasks with Qwen3-4B-Instruct, the contextual finite-answer projection stabilizes before the answer is parseable, with 17--31 token mean lead in the main templates and positive, shorter lead in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
