TL;DR
CEPO introduces a contrastive self-distillation method for reinforcement learning with verifiable rewards, sharpening reasoning credit and improving accuracy on mathematical benchmarks.
Contribution
It proposes CEPO, a novel contrastive policy optimization technique that enhances credit assignment by contrasting correct and incorrect answer signals without additional sampling costs.
Findings
CEPO achieves higher accuracy (43.43% and 60.56%) on five mathematical reasoning benchmarks.
CEPO outperforms GRPO and other self-distillation methods at the same training budgets.
Distribution-matching self-distillation methods underperform compared to untrained baselines.
Abstract
When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model's baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just "does the correct answer favor this token?" but "does the correct answer favor it while the wrong answer disfavors it?" A token satisfying both is a genuine reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
