Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

Mengjie Ren; Jie Lou; Boxi Cao; Xueru Wen; Hongyu Lin; Xianpei Han; Le Sun; Xing Yu; Yaojie Lu

arXiv:2605.14539·cs.CL·May 15, 2026

Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

Mengjie Ren, Jie Lou, Boxi Cao, Xueru Wen, Hongyu Lin, Xianpei Han, Le Sun, Xing Yu, Yaojie Lu

PDF

TL;DR

This paper introduces CIPO, a novel method that enhances reinforcement learning with verifiable rewards by converting failed trajectories into correction signals, significantly improving reasoning and correction abilities in language models.

Contribution

CIPO is a simple extension to RLVR that uses failed attempts for correction supervision, leading to better learning and reasoning in language models.

Findings

01

CIPO outperforms strong baselines on 11 benchmarks.

02

It significantly improves reasoning and correction performance.

03

CIPO enhances the model's intrinsic reasoning capacity.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.