Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning
Qinan Yu, Alexa Tartaglini, Peter Hase, Carlos Guestrin, Christopher Potts

TL;DR
This paper critically examines whether reinforcement learning with verifiable rewards truly encourages models to develop causally important reasoning, proposing metrics and methods to improve reasoning quality.
Contribution
It introduces two metrics for evaluating reasoning importance and sufficiency, and demonstrates how auxiliary rewards can enhance reasoning in language models.
Findings
RLVR improves accuracy but not reasoning importance or sufficiency.
Pre-training with supervised fine-tuning (SFT) can improve reasoning metrics.
Auxiliary CIR/SR rewards can match RLVR accuracy while enhancing reasoning.
Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) on chain-of-thought reasoning has become a standard part of language model post-training recipes. A common assumption is that the reasoning chains trained through RLVR reliably represent how a model gets to its answer. In this paper, we develop two metrics for critically examining this assumption: Causal Importance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the final answer, and Sufficiency of Reasoning (SR), which measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone. Through experiments with the Qwen2.5 model series and ReasoningGym tasks, we find that: (1) while RLVR does improve task accuracy, it does not reliably improve CIR or SR, calling the role of reasoning in model performance into question; (2) a small amount of SFT before RLVR can be a remedy for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
