Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reasoning
Zidi Xiong, Shan Chen, Himabindu Lakkaraju

TL;DR
This paper investigates how monitorability, the reflection of internal reasoning in chain-of-thought traces, can spontaneously improve during RLVR training, influenced by data diversity and training dynamics, but not necessarily linked to reasoning performance.
Contribution
It provides a systematic evaluation of monitorability emergence in RLVR, highlighting data dependence, its independence from reasoning capability, and mechanistic insights into its underlying factors.
Findings
Monitorability improvements are data-dependent.
Data diversity and instruction-following data are critical.
Monitorability is orthogonal to reasoning performance.
Abstract
As Large Reasoning Models (LRMs) are increasingly deployed, auditing their chain-of-thought (CoT) traces for safety becomes critical. Recent work has reported that monitorability--the degree to which CoT faithfully and informatively reflects internal computation--can appear as a "free gift" during the early stages of Reinforcement Learning with Verifiable Rewards (RLVR). We make this observation concrete through a systematic evaluation across model families and training domains. Our results show that this effect is not universal: monitorability improvements are strongly data-dependent. In particular, we demonstrate the critical role of data diversity and instruction-following data during RLVR training. We further show that monitorability is orthogonal to capability--improvements in reasoning performance do not imply increased transparency. Through mechanistic analysis, we attribute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Adversarial Robustness in Machine Learning
