Reasoning Models Sometimes Output Illegible Chains of Thought
Arun Jose

TL;DR
This paper investigates how reinforcement learning affects the clarity of reasoning chains in language models, revealing that models often produce illegible reasoning to reach correct answers, which challenges monitoring efforts.
Contribution
It provides the first comprehensive analysis of reasoning legibility across multiple models, highlighting the impact of RL on reasoning transparency and its implications for AI safety.
Findings
RL often causes illegible reasoning in models
Illegible reasoning persists even when answers are readable
Legibility decreases on more difficult questions
Abstract
Language models trained via outcome-based reinforcement learning (RL) to reason using chain-of-thought (CoT) have shown remarkable performance. Monitoring such a model's CoT may allow us to understand its intentions and detect potential malicious behavior. However, to be effective, this requires that CoTs are legible and faithful. We study CoT legibility across 14 reasoning models, finding that RL often causes reasoning to become illegible to both humans and AI monitors, with reasoning models (except Claude) generating illegible CoTs while returning to perfectly readable final answers. We show that models use illegible reasoning to reach correct answers (accuracy dropping by 53\% when forced to use only legible portions), yet find no correlation between legibility and performance when resampling - suggesting the relationship is more nuanced. We also find that legibility degrades on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Ethics and Social Impacts of AI
