Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process
Zhenyu Zhang, Shujian Zhang, John Lambert, Wenxuan Zhou, Zhangyang Wang, Mingqing Chen, Andrew Hard, Rajiv Mathews, Lun Wang

TL;DR
This paper introduces RISE, an unsupervised auto-encoder framework that uncovers and controls diverse reasoning behaviors in large language models by analyzing activation space, enabling interpretability and manipulation of reasoning processes.
Contribution
The work presents a novel unsupervised method using sparse auto-encoders to discover and control reasoning behaviors in LLMs without human-defined concepts.
Findings
Disentangles reasoning behaviors like reflection and backtracking in activation space
Enables controllable amplification or suppression of specific reasoning behaviors
Discovers novel reasoning behaviors beyond human supervision
Abstract
Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Unsupervised SAEs on step-level activations reveal interpretable reasoning vectors and enable causal behavior steering without retraining. 2. The method RISE is coherent with plausible identifiability arguments and consistent cross-benchmark evidence supporting the main claims. 3. The findings are valuable for interpretability and practical control of LLM reasoning. The paper presents a clean end-to-end pipeline with intuitive figures and explanations that make replication straightforward.
1. The paper claims that the behavior labels rely on an LLM-as-judge, but does not demonstrate robustness to model choice or prompt wording. 2. The composition of multiple behavior vectors may introduce interference due to non-orthogonality. 3. The identifiability of RISE claim depends on sparsity and incoherence assumptions, which are not empirically validated on real activations in this paper. 4. The experiments are confined to math reasoning, raising my concerns about domain generality.
1. The paper is well-structured, logical, and clearly written, making the complex concepts easy to follow. 2. The core idea of using SAEs to analyze the reasoning patterns of LLMs is novel and interesting. The demonstration that manipulating these disentangled features leads to a clear and observable enhancement or suppression of the corresponding reasoning behaviors is a significant strength. 3. The experiment to identify "confident reasoning vectors" by optimizing for entropy in the SAE decode
- In Figure 2, the separation between the *Reflection* and *Backtracking* clusters does not appear very distinct, and the normalized Silhouette scores in Figure 3 peak at only around 0.6, which is close to the boundary of meaningful separation. It would be helpful to clarify whether the clustering in the decoder space can be considered sufficiently reliable under these conditions. - In Section 4.5, the paper suggests that cluster formation in the mid-to-late layers reflects response length, but
1. This paper proposes an unsupervised framework for interpreting reasoning behavior, which requires no human intervention and can intuitively reflect the inference vectors during model reasoning. 2. This paper investigates the model's reasoning behavior through ample experiments. The results suggest that reflection and backtracking unraveling functions occupy separable regions in the decoder column space. Furthermore, the paper analyzes several factors influencing the model's reasoning behavio
1. Limited Domain Focus: The experimental section of the paper only used the Mathematics dataset (MATH) to analyze mathematical reasoning behavior. Adding reasoning tasks from other domains, such as common-sense reasoning and logical reasoning, might strengthen the experiment's persuasiveness. 2. In the experiments in Section 4.3, regarding the experimental results in Figure 3, the paper explains the statement that "the comparison between reflection and backtracking shows only moderate differen
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
