TL;DR
This paper investigates the interpretability of latent reasoning models, revealing they often do not utilize reasoning tokens and can decode reasoning traces, suggesting they encode interpretable processes.
Contribution
The study provides empirical evidence that latent reasoning tokens are often unnecessary, yet when needed, can be decoded into natural language traces, improving interpretability analysis.
Findings
LRMs often produce correct answers without using reasoning tokens.
Decoding gold reasoning traces is possible for 65-93% of correct predictions.
Verified reasoning traces can be decoded for most correct, but few incorrect, predictions.
Abstract
Latent reasoning models (LRMs) have attracted significant research interest due to their low inference cost (relative to explicit reasoning models) and theoretical ability to explore multiple reasoning paths in parallel. However, these benefits come at the cost of reduced interpretability: LRMs are difficult to monitor because they do not reason in natural language. This paper presents an investigation into LRM interpretability by examining two state-of-the-art LRMs. First, we find that latent reasoning tokens are often unnecessary for LRMs' predictions; on logical reasoning datasets, LRMs can almost always produce the same final answers without using latent reasoning at all. This underutilization of reasoning tokens may partially explain why LRMs do not consistently outperform explicit reasoning methods and raises doubts about the stated role of these tokens in prior work. Second, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
