MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models

Zhaokang Liao; Yingguo Gao; Yi Yang; Yongheng Hu; Jingting Ding

arXiv:2604.16972·cs.AI·April 21, 2026

MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models

Zhaokang Liao, Yingguo Gao, Yi Yang, Yongheng Hu, Jingting Ding

PDF

TL;DR

MCPO is a novel reinforcement learning algorithm designed to improve reasoning in large language models by consolidating mastery and enhancing training efficiency, leading to better performance on mathematical benchmarks.

Contribution

It introduces a mastery-consolidation mechanism with a hinge-KL regularizer and a prompt weighting scheme to address issues in existing RLVR algorithms.

Findings

01

MCPO improves pass@1 performance across benchmarks.

02

It enhances solution diversity by consolidating mastery.

03

It effectively manages policy drift and partial correctness in training.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach to improve the reasoning abilities of Large Language Models (LLMs). Among RLVR algorithms, Group Relative Policy Optimization (GRPO) and its variants have demonstrated strong performance and high training efficiency. However, GRPO-style objectives exhibit two issues on high accuracy prompts including mastered prompts (rollout accuracy =1) and majority-correct prompts (rollout accuracy in (0.5,1)). For mastered prompts, group-relative advantages vanish, yielding no training signal and unconstrained policy drift that can cause forgetting. For majority-correct prompts, the induced query weight shrinks as accuracy increases, weakening consolidation from partial correctness to mastery. To alleviate this, we propose Mastery-Consolidated Policy Optimization (MCPO), which introduces (i) a hinge-KL…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.