MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models
Zhaokang Liao, Yingguo Gao, Yi Yang, Yongheng Hu, Jingting Ding

TL;DR
MCPO is a novel reinforcement learning algorithm designed to improve reasoning in large language models by consolidating mastery and enhancing training efficiency, leading to better performance on mathematical benchmarks.
Contribution
It introduces a mastery-consolidation mechanism with a hinge-KL regularizer and a prompt weighting scheme to address issues in existing RLVR algorithms.
Findings
MCPO improves pass@1 performance across benchmarks.
It enhances solution diversity by consolidating mastery.
It effectively manages policy drift and partial correctness in training.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach to improve the reasoning abilities of Large Language Models (LLMs). Among RLVR algorithms, Group Relative Policy Optimization (GRPO) and its variants have demonstrated strong performance and high training efficiency. However, GRPO-style objectives exhibit two issues on high accuracy prompts including mastered prompts (rollout accuracy =1) and majority-correct prompts (rollout accuracy in (0.5,1)). For mastered prompts, group-relative advantages vanish, yielding no training signal and unconstrained policy drift that can cause forgetting. For majority-correct prompts, the induced query weight shrinks as accuracy increases, weakening consolidation from partial correctness to mastery. To alleviate this, we propose Mastery-Consolidated Policy Optimization (MCPO), which introduces (i) a hinge-KL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
