From Emergence to Control: Probing and Modulating Self-Reflection in Language Models

Xudong Zhu; Jiachen Jiang; Mohammad Mahdi Khalili; Zhihui Zhu

arXiv:2506.12217·cs.LG·June 17, 2025

From Emergence to Control: Probing and Modulating Self-Reflection in Language Models

Xudong Zhu, Jiachen Jiang, Mohammad Mahdi Khalili, Zhihui Zhu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper investigates the emergence and control of self-reflection in language models, revealing latent abilities in pretrained models and developing methods to enhance or suppress self-reflective reasoning for improved performance and efficiency.

Contribution

It introduces Reflection-Inducing Probing to reveal latent self-reflection in pretrained models and constructs a self-reflection vector for bidirectional control over reasoning behavior.

Findings

01

Self-reflection exists in pretrained models at low frequency.

02

Manipulating the self-reflection vector can improve reasoning accuracy.

03

Controlling self-reflection reduces computational costs.

Abstract

Self-reflection -- the ability of a large language model (LLM) to revisit, evaluate, and revise its own reasoning -- has recently emerged as a powerful behavior enabled by reinforcement learning with verifiable rewards (RLVR). While self-reflection correlates with improved reasoning accuracy, its origin and underlying mechanisms remain poorly understood. In this work, {\it we first show that self-reflection is not exclusive to RLVR fine-tuned models: it already emerges, albeit rarely, in pretrained models}. To probe this latent ability, we introduce Reflection-Inducing Probing, a method that injects reflection-triggering reasoning traces from fine-tuned models into pretrained models. This intervention raises self-reflection frequency of Qwen2.5 from 0.6\% to 18.6\%, revealing a hidden capacity for reflection. Moreover, our analysis of internal representations shows that both pretrained…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

S1. The paper presents an interesting and clear visualization of token-level separation using UMAP, highlighting distinct patterns between reflective and non-reflective representations. S2. The study is guided by good motivation, making the problem setup and research goal easy to understand.

Weaknesses

W1. Line 363 overstates the generality of the token-level correlation. Showing that one model (DeepSeek) emits “wait”-style tokens associated with reflection does not imply that other models share the same surface-form pattern. Test the setup on additional model families (e.g., LLaMA and Qwen) or explicitly acknowledge that this observation is model-specific improve quality of paper. Moreover, prior work (e.g., [2]) suggests that explicit markers like “Wait” or “Hmm” can increase generation leng

Reviewer 02Rating 6Confidence 4

Strengths

I really like the main idea of the paper. It is (to my knowledge) innovative, intuitive and actionable. The experimental design is well thought out, and the paper is overall easy to read.

Weaknesses

I have some issues with the focus on the single token "wait, " as an indicator for self-reflection. In fact, this token could also appear in other contexts. Given the tiny amount of self-reflection in the base model, this could actually have a substantial impact on a key finding of the paper, i.e., the emergence of self-reflection. I would urge the authors not only extend the set of respective tokens but also quantify, which sets of tokens are used by the base model accordingly. From the appendi

Reviewer 03Rating 4Confidence 5

Strengths

Strengths: - The analysis of model states and visualizations on both models showing clear separation between self-reflection and nonself-reflection states is very interesting. - The paper presents a good ablation study on the effect of their linear intervention method by varying the parameter alpha. - It is interesting to see that negative alpha does reduce the size of the response, and increasing alpha increases it. I don’t understand the reason behind the author's claim that higher output leng

Weaknesses

Scope for improvement: - The claim that an LLMs ability to reflect upon its responses during pre-training with “Wait” or similar words is not a new finding, and has been studied extensively in prior literature. Therefore, I think the paper has limited novelty. - The authors should shed some light on why they used difference-in-means to construct the self-reflection vector. I am curious what other metrics they evaluated before settling on using the mean. I see the ablations on comparing differenc

Code & Models

Repositories

xzascc/probingreflection
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling