Echoes as Anchors: Probabilistic Costs and Attention Refocusing in LLM Reasoning
Zhuoyuan Hao, Zhuo Li, Wu Li, Fangming Liu, Min Zhang, Jing Li

TL;DR
This paper investigates the spontaneous repetition in large reasoning models, formalizes its probabilistic cost, and develops methods to harness this echo phenomenon to improve reasoning accuracy.
Contribution
It introduces a probabilistic framework for understanding model echoes and proposes finetuning and prompting techniques to exploit this for better reasoning performance.
Findings
EOP increases answer-to-prefix attention in middle layers
Methods show consistent gains on multiple reasoning benchmarks
Formalizes echo removal as rejection-based conditioning
Abstract
Test-time compute allocation in large reasoning models (LRMs) is widely used and has applications in mathematical problem solving, code synthesis, and planning. Recent work has addressed this problem by scaling self-consistency and parallel thinking, adding generic ``thinking tokens'' and prompting models to re-read the question before answering. Unfortunately, these approaches either inject task-agnostic tokens or mandate heuristics that do not explain -- and often ignore -- the \emph{spontaneous} repetition that many LRMs exhibit at the head of their internal chains. In contrast, we analyze and harness the model's tendency to restate the question, which we term the \emph{Echo of Prompt (EOP)}, as a front-loaded, compute-shaping mechanism. We formalize its probabilistic cost by casting echo removal as rejection-based conditioning and defining the \emph{Echo Likelihood Gap}…
Peer Reviews
Decision·ICLR 2026 Poster
Simple, original, and well-motivated idea The latter half of the paper contains reasonably strong evidence suggesting their claims are true. By showing a strong improvement on SFT with EOP vs weak improvement without, I was convinced that EOP is mechanistically important to higher performance in RMs.
I am having trouble making sense of the claims within p3-4. Table 1 contains a lot of information that isn’t really explained. What is the N for each “group”? Are the “correct” and “wrong” the number of samples where the answer is correct in both cases, and in some it contains the EOP and in others it doesn’t? Do the same questions have samples in both classes? What does it mean for a specific raw trace to have a single echo-trimmed counterpart? Are they the same question? Further, how significa
- The paper is well-written and well-motivated. It starts from the phenomenon that “restate the question would help answer” and introduces their study methods and experiments solidly. - The idea of using Likelihood Gap is inspiring and interesting. - The attention-based analysis of the Echo Prompt’s effects is well-motivated and insightful. - Two types of experiments to demonstrate the effects of EOP are promising and comprehensive.
- As the author said in Lines 193-197, it seems a contradictory result. The “suffix-only gap” is actually larger for the wrong group (1.29 > 1.14), which contradicts the authors’ claim that EOP improves the correct group. They describe it as “the same pattern,” but the data show the opposite trend. Additionally, the authors should add the definition of “uffix-only gap” in the main paper. - Could you use experiments to prove that there is no “absolute weight value fluctuation” issue across differ
1. The paper focuses on an understudied and not well-understood phenomenon in LLM reasoning. It asks how redundancy in the reasoning traces could actually be helpful to the model reasoning. 2. The analysis framework is reasonable, and the results suggest some correlation between EOP and reasoning correctness. The analysis is deep and insightful. Careful ablations such as on prefix length, attention-layer grouping support the results. 3. I find how the authors took their findings and used them
1. Causality remains speculative: The correlation between echoes and accuracy is solid, but the paper doesn’t prove causality. It’s perfectly possible that correct traces happen to include EOPs because the model is already more deliberate. 2. Some of the conclusions are not fully justified: I am not super convinced that the answer-to-answer-attention gap shown in Fig. 3 left is purely a product of EOPs. The authors should show the same analysis on traces without EOPs. 3. The finetuning setup
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Model-Driven Software Engineering Techniques · Machine Learning and Algorithms
