Spurious Rewards: Rethinking Training Signals in RLVR
Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, Yulia Tsvetkov, Hannaneh Hajishirzi, Pang Wei Koh, Luke Zettlemoyer

TL;DR
This paper reveals that reinforcement learning with verifiable rewards can induce strong reasoning abilities in language models even with spurious, non-informative rewards, highlighting the importance of cross-model validation.
Contribution
It demonstrates that spurious rewards can produce significant performance gains in certain models due to clipping bias, challenging assumptions about reward informativeness in RLVR.
Findings
Spurious rewards improve Qwen2.5-Math performance significantly.
Clipping bias amplifies high-prior behaviors learned during pretraining.
Effectiveness of spurious rewards varies across different model families.
Abstract
We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR training with GRPO improves MATH-500 performance for Qwen2.5-Math-7B by 21.4 percentage points using randomly assigned rewards, nearly matching the 29.1-point gain from ground-truth rewards. To explain this counterintuitive observation, we show that GRPO exhibits a clipping bias from the clip term, which can amplify high-prior behaviors learned during pretraining even without informative rewards. As a case study, we identify one such behavior in Qwen2.5-Math models, which we call code reasoning -- reasoning in code without actual code execution; code-reasoning frequency increases from 65 percent to over 90 percent with spurious…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
**\[S1\] Novel and impactful findings** The paper presents a striking discovery that RLVR can produce substantial performance improvements even with spurious rewards (Qwen). Furthermore, the insight that RLVR methods should be validated across diverse model families rather than relying on a single architecture is a valuable contribution that will benefit the broader research community in avoiding potentially misleading conclusions. **\[S2\] Rigorous experimental design and analysis** The
**\[W1\]** While the lack of access to pretraining data makes deeper investigation challenging and understandable, the paper's reliance on phenomenon discovery and analysis through one representative behavior (code reasoning) leaves some mechanistic questions unanswered. Although the authors reasonably acknowledge this as future work, a deeper mechanistic explanation would have strengthened the contribution. **\[W2\]** In Section 5 (Curious case), the authors provide additional analysis that su
1. The paper conducts a very detailed set of experiments across different model families, including Qwen and Llama, which allows it to comprehensively demonstrate the scope of its conclusions, mainly applicable to the Qwen series. 2. In the final part, the paper provides an algorithmic analysis that explains its findings from the optimization perspective.
1. The experimental results in this paper lack sufficient justification. In Appendix I, the authors show Qwen2.5-Math-7B base model scores under different prompts: the best prompt (from SimpleRL-Zoo) reaches 63.2, while the prompt actually used in the main experiments scores only 49.4. This is a large and critical gap, yet Figure 1, the paper’s most important result, uses the lower 49.4 baseline. The same issue appears for Qwen2.5-7B, whose baseline is also lower than that reported in SimpleRL-Z
The authors conducted RLVR experiments on multiple models and with different signals including Ground Truth, Majority Vote, One-Shot RL, Format Reward, Incorrect Label, and Random Reward, and the training process was comprehensively demonstrated.
I greatly admire the authors' attempt to explain why spurious rewards are only effective on Qwen series models through reasoning behaviors learned during pre-training (such as code reasoning). However, I am not convinced by their experimental design, observations, and corresponding explanations. The phenomena and hypotheses feel quite superficial, and some many observations and conclusions may be contradictory. 1. The authors propose in Section 4.2 that "Performance is correlated with code rea
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms
