The Potential of CoT for Reasoning: A Closer Look at Trace Dynamics
Gregor Bachmann, Yichen Jiang, Seyed Mohsen Moosavi Dezfooli, Moin Nabi

TL;DR
This paper analyzes chain-of-thought prompting in large language models, introducing the concept of potential to quantify reasoning contributions, revealing complex patterns and transferability of reasoning insights across models.
Contribution
It introduces the notion of potential to measure reasoning impact and explores CoT transferability between models, providing new insights into reasoning dynamics in LLMs.
Findings
Potential often non-monotonic and unpredictable
Sharp spikes in potential indicate reasoning insights or jumps
Partial CoT from a stronger model can significantly improve weaker models' performance
Abstract
Chain-of-thought (CoT) prompting is a de-facto standard technique to elicit reasoning-like responses from large language models (LLMs), allowing them to spell out individual steps before giving a final answer. While the resemblance to human-like reasoning is undeniable, the driving forces underpinning the success of CoT reasoning still remain largely unclear. In this work, we perform an in-depth analysis of CoT traces originating from competition-level mathematics questions, with the aim of better understanding how, and which parts of CoT actually contribute to the final answer. To this end, we introduce the notion of a potential, quantifying how much a given part of CoT increases the likelihood of a correct completion. Upon examination of reasoning traces through the lens of the potential, we identify surprising patterns including (1) its often strong non-monotonicity (due to reasoning…
Peer Reviews
Decision·ICLR 2026 Poster
- The authors characterize a new way to study which parts of the CoT are most impactful/useful in the model's reasoning via a new metric they call "potential". - Some interesting observations are made that connect the changes in the potential with different parts/strategies in the CoT. - The paper is largely well-written.
- The implications and future applications of the potential metric are unclear (or they seem rather limited). Additional quantitative analysis may enable us to draw more general observations about the reasoning of LLMs. The conclusion is missing a discussion on future work, which could have alleviated this concern. - Experiments were focused on math competition examples, and it's unclear if similar observations would be drawn in other domains.
The paper addresses an important gap in understanding how CoT actually helps models reason, not just appear to. Introducing “potential” as a metric is novel and provides a quantitative way to evaluate reasoning progress. The experiments on AIME problems are well-motivated and the qualitative analysis of “reasoning tangents” and “insights” is compelling. The study also goes beyond introspection by testing CoT transferability, showing interesting empirical results that small parts of reasoning fro
The proposed concept of potential is not rigorously justified beyond empirical correlation, and its interpretation remains vague. The assumption that higher potential implies genuine reasoning progress is too strong, as sampling-based estimation might simply capture distributional artifacts. The analysis often reads descriptive rather than explanatory; the paper reports patterns (insights, jumps, tangents) without providing mechanisms or theoretical grounding. The heavy reliance on visual exampl
(1) Novel and Intuitive Methodology: The core contribution, the "potential" metric (Eq. 1), offers a principled and intuitive method for quantifying progress within a CoT. This provides a valuable tool for moving beyond simple final-answer accuracy to analyze the intermediate reasoning steps. (2) Strong Qualitative Analysis: The paper excels in its qualitative analysis, providing clear and well-annotated examples that connect the behavior of the potential curve to specific segments of the model
(1) Estimating the potential metric requires sampling a large number of completions ($N=128$) at every intermediate CoT step to obtain a stable estimate. The manuscript does not address the resulting computational overhead, which may pose a serious limitation for scalability and broader adoption. (2) The quantitative analysis in Table 1 depends on fixed, manually chosen thresholds to categorize behaviors such as “insights” (potential increase > 40%), “tangents” (potential drop > 30%), and “gues
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Multimodal Machine Learning Applications · Embodied and Extended Cognition
