Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
Mengjie Ren, Jie Lou, Boxi Cao, Xueru Wen, Hongyu Lin, Xianpei Han, Le Sun, Xing Yu, Yaojie Lu

TL;DR
This paper introduces CIPO, a novel method that enhances reinforcement learning with verifiable rewards by converting failed trajectories into correction signals, significantly improving reasoning and correction abilities in language models.
Contribution
CIPO is a simple extension to RLVR that uses failed attempts for correction supervision, leading to better learning and reasoning in language models.
Findings
CIPO outperforms strong baselines on 11 benchmarks.
It significantly improves reasoning and correction performance.
CIPO enhances the model's intrinsic reasoning capacity.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
