Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought?
Alexander von Recum, Leander Girrbach, Zeynep Akata

TL;DR
This paper investigates the robustness of reasoning large language models (RLLMs) to internal perturbations of their chain-of-thoughts, revealing their resilience, style-dependent effects, and trade-offs between robustness and efficiency.
Contribution
Introduces a controlled framework for perturbing RLLMs' reasoning traces, analyzing their robustness, style effects, and recovery mechanisms across multiple tasks and models.
Findings
RLLMs are generally robust to perturbations, with robustness increasing with model size.
Early interventions degrade robustness and performance.
Paraphrasing reduces doubt expressions and can harm accuracy.
Abstract
Reasoning LLMs (RLLMs) generate step-by-step chains of thought (CoTs) before giving an answer, which improves performance on complex tasks and makes reasoning more transparent. But how robust are these reasoning traces to disruptions that occur within them? To address this question, we introduce a controlled evaluation framework that perturbs a model's own CoT at fixed timesteps. We design seven interventions (benign, neutral, and adversarial) and apply them to multiple open-weight RLLMs across Math, Science, and Logic tasks. Our results show that RLLMs are generally robust, reliably recovering from diverse perturbations, with robustness improving with model size and degrading when interventions occur early. However, robustness is not style-invariant: paraphrasing suppresses doubt-like expressions and reduces performance, while other interventions trigger doubt and support recovery.…
Peer Reviews
Decision·ICLR 2026 Poster
The dataset builds over large number of existing benchmark dataset, extending over various science, math, and logic domains. Moreover, this paper experimented over wide variety of reasoning models including, R1 distill, exaone, nemotron, phi, and QwQ. The discovery that LRMs are generally robust to various types of intervention, regardless of the benign, neural, or adversarial .types. Also, it is notable that LRM generation length inflates the most when interrupted with random texts for recover
The largest concern is that the reasoning trace interruption scenario is extremely far from realistic usage cases, which assumes that the model reasoning will be abruptly interrupted by external signals. Moreover the interruptions are hardly meaningful in logical or semantic sense, since it introduces noises that are completely irrelevant with previous context and model generations. Rather than injecting random text from external sources, what if the model is injected with reasoning trace that l
- Efficiency of RMs is an important consideration. While it is not the explicit goal, I think the paper's results on efficiency are interesting and the dataset can help study efficiency further. - Reasonable set of interventions and results across models. - Ablations on why RMs are able to recover (using "wait" -like tokens) and how the interventions increase inference time per answer. I think this is the strongest part of the paper.
- Some of the results need more examination. For example, table 5 shows that benign rewrites lead to drops up to 60%, but then Table 6 shows that there is no drop in CoT length across all intervention timesteps (except 0.9). - There are interesting observations, but the insight is weak. For example, introducing wrong reasoning increases accuracy robustness and paraphrasing reduces accuracy. Surely, there are some confounding factors here that can be explored? - Many of the results are "observati
1. By intervening directly in a model's own reasoning trace, the authors create a clean and realistic testbed for self-correction, which is a significant step forward in robustness evaluation. 2. The study is impressively broad, covering 9 models, 3 domains, 7 intervention types, and 5 timesteps. This thoroughness lends high credibility to the conclusions and suggests that the findings are generalizable. 3. The paper is well-written. The research question, methods, and results are communica
1. While identifying "doubt" is a major strength, the analysis relies on an LLM-based classifier to label sentences. This approach, while pragmatic, is somewhat superficial. The paper would be significantly strengthened by a deeper investigation into how "doubt" is represented internally. The activation analysis in the appendix is a good first step, but it feels underdeveloped. A more detailed analysis connecting specific internal states to the expression of doubt would benifit the contribution
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Explainable Artificial Intelligence (XAI) · Topic Modeling
