Lost at the Beginning of Reasoning
Baohao Liao, Xinyi Chen, Sara Rajaee, Yuhui Xu, Christian Herold, Anders S{\o}gaard, Maarten de Rijke, Christof Monz

TL;DR
This paper demonstrates that the initial reasoning step in large language models heavily influences overall accuracy, and proposes a cost-effective sampling method to improve reasoning quality by focusing on the first step.
Contribution
It reveals the critical impact of the first reasoning step on model predictions and introduces an efficient sampling strategy to enhance reasoning quality and reduce inference costs.
Findings
Errors at the first reasoning step significantly degrade final predictions
The proposed sampling strategy reduces inference cost by up to 70%
High-quality initial reasoning steps lead to better overall reasoning outcomes
Abstract
Recent advancements in large language models (LLMs) have significantly advanced complex reasoning capabilities, particularly through extended chain-of-thought (CoT) reasoning that incorporates mechanisms such as backtracking, self-reflection, and self-correction. Despite these developments, the self-correction abilities of LLMs during long CoT reasoning remain underexplored. And recent findings on overthinking suggest that such models often engage in unnecessarily redundant reasoning. In this work, we empirically show that the first reasoning step exerts a disproportionately large influence on the final prediction. I.e., errors introduced at this stage can substantially degrade subsequent reasoning quality. This phenomenon is consistently observed across various state-of-the-art open- and closed-source reasoning models. Leveraging this insight, we propose an efficient sampling strategy…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper is well-written, and the idea is interesting. Specifically, it shows that first-step quality heavily correlates with final correctness, and even small perturbations to an otherwise correct first step can yield large accuracy drops. Also, this paper proposes a simple and effective method by early pruning via PRM-scored first steps. Besides, the evaluation covers a lot of models, enhancing the robustness of the evaluation, as well as useful ablation studies. Lastly, the paper uncovers cl
**Limited evaluation datasets for Section 3**: While testing multiple LLMs, the only datasets used for Section 3 are AIME 24 and 25, which only include 60 questions in total. This raises concerns about the robustness of the analysis, given that it is the main finding of this paper. An extended evaluation on other reasoning datasets could make the results more generalizable. **Pruning potentially contributive segments**: By pruning early, we may discard rare but ultimately correct traces that re
The decreasing pattern in the similarity score between each intermediate reasoning step and the final reasoning step is interesting. The proposed method to improve the accuracy of the first reasoning step looks reasonable.
The definition of reasoning step is vague. This paper vaguely states that a reasoning step is a complete logical leap or self-contained unit. Precisely, what is a reasoning step? The reasoning step segmentation method is questionable. First, the segmentation accuracy is not shown in this paper. This leads to a critical technical flaw. More specifically, if the segmentation accuracy is low, should we trust the observation in this paper? Also, it is possible that the segmented last step does not
1. The paper provides a clean and reproducible empirical observation that the first reasoning step strongly correlates with the final answer. 2. The proposed pruning strategy is simple yet effective in improving the original LLM's reasoning performance. 3. Extensive evaluation is conducted on diverse datasets. 4. The writing is mostly clear, figures are illustrative and the experiments are easy to follow.
1. Limited novelty and conceptual depth. The main claim (“the first step matters most”) is quite trivial and incremental, especially considering the previous works (e.g. Lost in the Middle, and other efficient CoT and overthinking works) which study the importance of reasoning steps. This claim is only supported by the semantic similarity rather and a causality analysis. Besides, the actual reason why the first step shows more similarity is not explored, because it is possible that both the fir
1. Provides empirical evidence on the link between early reasoning and final correctness 2. The idea and presentation are clear: definitions are stated, the pipeline is easy to follow, main figures/tables support the claims, showing the effectiveness of their method.
1. Ambiguous “beginning of reasoning.” §3.1/§3.2 define the first step via two different segmentations, while §3.3 is actually a prefix intervention (front-loading the conclusion segment). Main experiments also approximate “first step” by a fixed token length. These heterogeneous settings weaken the causal claims. 2. Lack of position analysis for errors. Current evidence does not establish that only beginning errors are disproportionately harmful (as the abstract suggests). Mid- or last-step er
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
