CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

Ahmed Heakl; Abdelrahman M. Shaker; Youssef Mohamed; Rania Elbadry; Omar Fetouh; Fahad Shahbaz Khan; Salman Khan

arXiv:2605.19436·cs.LG·May 20, 2026

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

Ahmed Heakl, Abdelrahman M. Shaker, Youssef Mohamed, Rania Elbadry, Omar Fetouh, Fahad Shahbaz Khan, Salman Khan

PDF

1 Repo

TL;DR

CEPO introduces a contrastive self-distillation method for reinforcement learning with verifiable rewards, sharpening reasoning credit and improving accuracy on mathematical benchmarks.

Contribution

It proposes CEPO, a novel contrastive policy optimization technique that enhances credit assignment by contrasting correct and incorrect answer signals without additional sampling costs.

Findings

01

CEPO achieves higher accuracy (43.43% and 60.56%) on five mathematical reasoning benchmarks.

02

CEPO outperforms GRPO and other self-distillation methods at the same training budgets.

03

Distribution-matching self-distillation methods underperform compared to untrained baselines.

Abstract

When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model's baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just "does the correct answer favor this token?" but "does the correct answer favor it while the wrong answer disfavors it?" A token satisfying both is a genuine reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ahmedheakl/CEPO
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.