Step-Level Sparse Autoencoder for Reasoning Process Interpretation
Xuan Yang, Jiayu Liu, Yuhang Lai, Hao Xu, Zhenya Huang, Ning Miao

TL;DR
This paper introduces a step-level sparse autoencoder (SSAE) to interpret LLM reasoning processes by disentangling step features, enabling analysis of reasoning direction, semantic transitions, and properties like correctness and logicality.
Contribution
The proposed SSAE captures step-level reasoning features with controlled sparsity, improving interpretability of LLMs' reasoning steps beyond token-level analysis.
Findings
Extracted features predict reasoning correctness and logicality
LMMs partly encode properties like generation length during reasoning
SSAE enhances understanding of LLMs' reasoning process
Abstract
Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpretability, existing approaches predominantly operate at the token level, creating a granularity mismatch when capturing more critical step-level information, such as reasoning direction and semantic transitions. In this work, we propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features. Specifically, by precisely controlling the sparsity of a step feature conditioned on its context, we form an information bottleneck in step reconstruction, which splits incremental information from background information and disentangles it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications
