CoRPO: Adding a Correctness Bias to GRPO Improves Generalization

Anisha Garg; Claire Zhang; Nishit Neema; David Bick; Ganesh Venkatesh; Joel Hestness

arXiv:2511.04439·cs.AI·March 6, 2026

CoRPO: Adding a Correctness Bias to GRPO Improves Generalization

Anisha Garg, Claire Zhang, Nishit Neema, David Bick, Ganesh Venkatesh, Joel Hestness

PDF

Open Access

TL;DR

CoRPO enhances GRPO by adding a correctness bias through baseline clipping, leading to improved out-of-domain reasoning and more robust, transferable problem-solving capabilities in large language models.

Contribution

The paper introduces CoRPO, a simple modification to GRPO that clips the baseline to a correctness threshold, reducing overfitting and improving generalization.

Findings

01

CoRPO outperforms GRPO on cross-domain reasoning tasks.

02

Models trained with CoRPO generalize better to out-of-domain tasks.

03

CoRPO demonstrates robustness and transferability across different problem domains.

Abstract

Group-Relative Policy Optimization (GRPO) has emerged as the standard for training reasoning capabilities in large language models through reinforcement learning. By estimating advantages using group-mean rewards rather than a learned critic, GRPO has enabled efficient scaling of reinforcement learning from verifiable rewards (RLVR). However, we identify a fundamental limitation: GRPO's mean baseline can assign positive advantages to incorrect solutions simply because they outperform a poorly-performing group average. It leads to overestimation of advantages and reinforcement of incorrect behaviours. To address this, we propose Correctness-Relative Policy Optimization (CoRPO), a simple modification to the GRPO objective that clips the minimum baseline to a fixed correctness threshold. We show that baseline clipping introduces a protective bias to advantage estimation that mitigates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Machine Learning and Data Classification · Advanced Bandit Algorithms Research