Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards
Xinyu Tang, Yuliang Zhan, Zhixun Li, Wayne Xin Zhao, Zhenduo Zhang, Zujie Wen, Zhiqiang Zhang, Jun Zhou

TL;DR
This paper investigates how positive and negative samples influence reinforcement learning with verifiable rewards in large reasoning models, proposing an adaptive advantage shaping method to improve training and reasoning capabilities.
Contribution
It provides a systematic analysis of sample polarities in RLVR and introduces A3PO, a novel method for advantage shaping at token level to enhance reasoning model training.
Findings
Positive samples sharpen correct reasoning patterns.
Negative samples promote exploration of new reasoning paths.
A3PO improves performance across five reasoning benchmarks.
Abstract
Large reasoning models (LRMs) are typically trained using reinforcement learning with verifiable reward (RLVR) to enhance their reasoning abilities. In this paradigm, policies are updated using both positive and negative self-generated rollouts, which correspond to distinct sample polarities. In this paper, we provide a systematic investigation into how these sample polarities affect RLVR training dynamics and behaviors. We find that positive samples sharpen existing correct reasoning patterns, while negative samples encourage exploration of new reasoning paths. We further explore how adjusting the advantage values of positive and negative samples at both the sample level and the token level affects RLVR training. Based on these insights, we propose an Adaptive and Asymmetric token-level Advantage shaping method for Policy Optimization, namely A3PO, that more precisely allocates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Machine Learning and Data Classification
