ScRPO: From Errors to Insights

Lianrui Li; Dakuan Lu; Jiawei Shao; Xuelong Li

arXiv:2511.06065·cs.AI·January 6, 2026

ScRPO: From Errors to Insights

Lianrui Li, Dakuan Lu, Jiawei Shao, Xuelong Li

PDF

Open Access

TL;DR

ScRPO is a reinforcement learning framework that enhances large language models' mathematical reasoning by iterative self-reflection and error correction, leading to significant performance improvements on challenging benchmarks.

Contribution

The paper introduces ScRPO, a novel two-phase self-correction reinforcement learning method that improves reasoning capabilities of language models through iterative error analysis and correction.

Findings

01

Achieves 64.8% and 77.8% accuracy on mathematical benchmarks.

02

Outperforms vanilla baselines by 6.0% and 3.2%.

03

Outperforms strong post-training methods like DAPO and GRPO.

Abstract

We introduce Self-correction Relative Policy Optimization (ScRPO), a novel reinforcement learning framework designed to empower large language models with advanced mathematical reasoning capabilities through iterative self-reflection and error correction. The ScRPO framework operates in two distinct phases: (1) Trial-and-error learning stage, where the model is trained via GRPO, and incorrect responses are collected to form an "error pool"; and (2) Self-correction learning stage, which guides the model to introspectively analyze and rectify the reasoning flaws behind its previous errors. Extensive evaluations across challenging mathematical benchmarks, including AIME, AMC, Olympiad, MATH-500, and GSM8k, validate the efficacy of our approach. Using DeepSeek-R1-Distill-Qwen-1.5B and 7B as backbones, ScRPO achieves average accuracies of 64.8% and 77.8%, respectively. This represents a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications