Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

Qinan Yu; Alexa Tartaglini; Peter Hase; Carlos Guestrin; Christopher Potts

arXiv:2604.22074·cs.CL·April 27, 2026

Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

Qinan Yu, Alexa Tartaglini, Peter Hase, Carlos Guestrin, Christopher Potts

PDF

TL;DR

This paper critically examines whether reinforcement learning with verifiable rewards truly encourages models to develop causally important reasoning, proposing metrics and methods to improve reasoning quality.

Contribution

It introduces two metrics for evaluating reasoning importance and sufficiency, and demonstrates how auxiliary rewards can enhance reasoning in language models.

Findings

01

RLVR improves accuracy but not reasoning importance or sufficiency.

02

Pre-training with supervised fine-tuning (SFT) can improve reasoning metrics.

03

Auxiliary CIR/SR rewards can match RLVR accuracy while enhancing reasoning.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) on chain-of-thought reasoning has become a standard part of language model post-training recipes. A common assumption is that the reasoning chains trained through RLVR reliably represent how a model gets to its answer. In this paper, we develop two metrics for critically examining this assumption: Causal Importance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the final answer, and Sufficiency of Reasoning (SR), which measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone. Through experiments with the Qwen2.5 model series and ReasoningGym tasks, we find that: (1) while RLVR does improve task accuracy, it does not reliably improve CIR or SR, calling the role of reasoning in model performance into question; (2) a small amount of SFT before RLVR can be a remedy for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.