Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
Feng Zhang, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, Yang Yang, Guanjun Jiang

TL;DR
This paper introduces ConSPO, a contrastive sequence-level policy optimization framework for RLVR, addressing limitations of prior methods and improving reasoning performance in language models.
Contribution
ConSPO reformulates RLVR scoring to align with likelihoods and employs a contrastive, curriculum-guided approach for better credit assignment and model training.
Findings
ConSPO outperforms strong RLVR baselines on mathematical reasoning benchmarks.
Reformulation reveals limitations of likelihood-misaligned scoring and score-insensitive credit assignment.
Extensive evaluations demonstrate consistent improvements across models and datasets.
Abstract
RLVR has become a widely adopted paradigm for improving LLMs' reasoning capabilities, and GRPO is one of its most representative algorithms. In this paper, we first show that GRPO admits an equivalent discriminative reformulation as a weighted positive-negative score difference. Under this view, GRPO increases sequence-level scores of verified positive rollouts and decreases those of negative rollouts, where the scores are averages of clipped token-level importance sampling ratios. This reformulation reveals two structural limitations of GRPO: likelihood-misaligned scoring, where clipped ratio-based surrogate scores are optimized instead of generation likelihoods, and score-insensitive credit assignment, where rollout-level credit is assigned without accounting for relative score gaps between positive and negative rollouts in the same group. To address these limitations, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
