Are Latent Reasoning Models Easily Interpretable?

Connor Dilgren; Sarah Wiegreffe

arXiv:2604.04902·cs.LG·April 7, 2026

Are Latent Reasoning Models Easily Interpretable?

Connor Dilgren, Sarah Wiegreffe

PDF

1 Repo

TL;DR

This paper investigates the interpretability of latent reasoning models, revealing they often do not utilize reasoning tokens and can decode reasoning traces, suggesting they encode interpretable processes.

Contribution

The study provides empirical evidence that latent reasoning tokens are often unnecessary, yet when needed, can be decoded into natural language traces, improving interpretability analysis.

Findings

01

LRMs often produce correct answers without using reasoning tokens.

02

Decoding gold reasoning traces is possible for 65-93% of correct predictions.

03

Verified reasoning traces can be decoded for most correct, but few incorrect, predictions.

Abstract

Latent reasoning models (LRMs) have attracted significant research interest due to their low inference cost (relative to explicit reasoning models) and theoretical ability to explore multiple reasoning paths in parallel. However, these benefits come at the cost of reduced interpretability: LRMs are difficult to monitor because they do not reason in natural language. This paper presents an investigation into LRM interpretability by examining two state-of-the-art LRMs. First, we find that latent reasoning tokens are often unnecessary for LRMs' predictions; on logical reasoning datasets, LRMs can almost always produce the same final answers without using latent reasoning at all. This underutilization of reasoning tokens may partially explain why LRMs do not consistently outperform explicit reasoning methods and raises doubts about the stated role of these tokens in prior work. Second, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

connordilgren/are-lrms-easily-interpretable
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.