How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?
Sohee Yang, Sang-Woo Lee, Nora Kassner, Daniela Gottesman, Sebastian Riedel, Mor Geva

TL;DR
This paper investigates how well reasoning models can identify and recover from unhelpful thoughts, revealing their limitations in self-reevaluation and the need for improved meta-cognitive abilities for safer AI systems.
Contribution
The study systematically evaluates reasoning models' ability to detect and recover from various unhelpful thoughts, highlighting their current shortcomings and non-intuitive scaling behaviors.
Findings
Models effectively identify most unhelpful thoughts.
Models struggle to recover once unhelpful thoughts are injected.
Larger models perform worse in recovering from irrelevant thoughts.
Abstract
Recent reasoning models show the ability to reflect, backtrack, and self-validate their reasoning, which is crucial in spotting mistakes and arriving at accurate solutions. A natural question that arises is how effectively models can perform such self-reevaluation. We tackle this question by investigating how well reasoning models identify and recover from four types of unhelpful thoughts: uninformative rambling thoughts, thoughts irrelevant to the question, thoughts misdirecting the question as a slightly different question, and thoughts that lead to incorrect answers. We show that models are effective at identifying most unhelpful thoughts but struggle to recover from the same thoughts when these are injected into their thinking process, causing significant performance drops. Models tend to naively continue the line of reasoning of the injected irrelevant thoughts, which showcases…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDecision-Making and Behavioral Economics · Mind wandering and attention · Optimism, Hope, and Well-being
