Temporal Predictors of Outcome in Reasoning Language Models
Joey David

TL;DR
This paper investigates how early in the reasoning process large language models internally predict their final outcome, revealing that correctness can be anticipated after just a few tokens, impacting interpretability and control.
Contribution
It demonstrates that internal correctness predictions emerge early in reasoning, and highlights how question difficulty influences reasoning trajectories in language models.
Findings
Correctness can be predicted after few reasoning tokens.
Hard questions show a drop in predictive accuracy.
Longer reasoning chains often contain more difficult items.
Abstract
The chain-of-thought (CoT) paradigm uses the elicitation of step-by-step rationales as a proxy for reasoning, gradually refining the model's latent representation of a solution. However, it remains unclear just how early a Large Language Model (LLM) internally commits to an eventual outcome. We probe this by training linear classifiers on hidden states after the first t reasoning tokens, showing that eventual correctness is highly predictable after only a few tokens, even when longer outputs are needed to reach a definite answer. We show that, for harder questions, a drop in predictive accuracy highlights a selection artifact: hard items are disproportionately represented in long CoTs. Overall, our results imply that for reasoning models, internal self-assessment of success tends to emerge after only a few tokens, with implications for interpretability and for inference-time control.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Ethics and Social Impacts of AI
