Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Zhenyu Zhang; Shujian Zhang; John Lambert; Wenxuan Zhou; Zhangyang Wang; Mingqing Chen; Andrew Hard; Rajiv Mathews; Lun Wang

arXiv:2512.23988·cs.CL·January 1, 2026

Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Zhenyu Zhang, Shujian Zhang, John Lambert, Wenxuan Zhou, Zhangyang Wang, Mingqing Chen, Andrew Hard, Rajiv Mathews, Lun Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces RISE, an unsupervised auto-encoder framework that uncovers and controls diverse reasoning behaviors in large language models by analyzing activation space, enabling interpretability and manipulation of reasoning processes.

Contribution

The work presents a novel unsupervised method using sparse auto-encoders to discover and control reasoning behaviors in LLMs without human-defined concepts.

Findings

01

Disentangles reasoning behaviors like reflection and backtracking in activation space

02

Enables controllable amplification or suppression of specific reasoning behaviors

03

Discovers novel reasoning behaviors beyond human supervision

Abstract

Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. Unsupervised SAEs on step-level activations reveal interpretable reasoning vectors and enable causal behavior steering without retraining. 2. The method RISE is coherent with plausible identifiability arguments and consistent cross-benchmark evidence supporting the main claims. 3. The findings are valuable for interpretability and practical control of LLM reasoning. The paper presents a clean end-to-end pipeline with intuitive figures and explanations that make replication straightforward.

Weaknesses

1. The paper claims that the behavior labels rely on an LLM-as-judge, but does not demonstrate robustness to model choice or prompt wording. 2. The composition of multiple behavior vectors may introduce interference due to non-orthogonality. 3. The identifiability of RISE claim depends on sparsity and incoherence assumptions, which are not empirically validated on real activations in this paper. 4. The experiments are confined to math reasoning, raising my concerns about domain generality.

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper is well-structured, logical, and clearly written, making the complex concepts easy to follow. 2. The core idea of using SAEs to analyze the reasoning patterns of LLMs is novel and interesting. The demonstration that manipulating these disentangled features leads to a clear and observable enhancement or suppression of the corresponding reasoning behaviors is a significant strength. 3. The experiment to identify "confident reasoning vectors" by optimizing for entropy in the SAE decode

Weaknesses

- In Figure 2, the separation between the *Reflection* and *Backtracking* clusters does not appear very distinct, and the normalized Silhouette scores in Figure 3 peak at only around 0.6, which is close to the boundary of meaningful separation. It would be helpful to clarify whether the clustering in the decoder space can be considered sufficiently reliable under these conditions. - In Section 4.5, the paper suggests that cluster formation in the mid-to-late layers reflects response length, but

Reviewer 03Rating 4Confidence 3

Strengths

1. This paper proposes an unsupervised framework for interpreting reasoning behavior, which requires no human intervention and can intuitively reflect the inference vectors during model reasoning. 2. This paper investigates the model's reasoning behavior through ample experiments. The results suggest that reflection and backtracking unraveling functions occupy separable regions in the decoder column space. Furthermore, the paper analyzes several factors influencing the model's reasoning behavio

Weaknesses

1. Limited Domain Focus: The experimental section of the paper only used the Mathematics dataset (MATH) to analyze mathematical reasoning behavior. Adding reasoning tasks from other domains, such as common-sense reasoning and logical reasoning, might strengthen the experiment's persuasiveness. 2. In the experiments in Section 4.3, regarding the experimental results in Figure 3, the paper explains the statement that "the comparison between reflection and backtracking shows only moderate differen

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)