Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

Anamika Lochab; Bolian Li; Ruqi Zhang

arXiv:2605.00365·cs.LG·May 4, 2026

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

Anamika Lochab, Bolian Li, Ruqi Zhang

PDF

1 Repo

TL;DR

This paper introduces Uniform-Correct Policy Optimization (UCPO), a method that enhances diversity and multi-sample coverage in reinforcement learning with verifiable rewards by promoting uniform probability distribution over correct solutions.

Contribution

The paper formalizes the cause of diversity collapse in RLVR and proposes UCPO, a novel optimization technique that improves diversity and coverage without sacrificing accuracy.

Findings

01

UCPO improves Pass@K and diversity across multiple models and benchmarks.

02

UCPO achieves up to +10% on AIME24 Pass@64.

03

UCPO increases equation-level diversity by up to 45%.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degradation: common RLVR objectives, such as GRPO, are indifferent to how probability mass is distributed among correct solutions. Combined with stochastic training dynamics, this indifference induces a self-reinforcing collapse, in which probability mass concentrates on a narrow subset of correct outputs while alternative valid solutions are suppressed. We formalize this collapse mechanism and further characterize the optimal policy structure under two complementary criteria: robustness and entropy-regularized optimality, which identify the Uniform-Correct Policy as uniquely optimal. Motivated by this analysis, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AnamikaLochab/UCPO
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.