CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

Sijia Cui; Pengyu Cheng; Jiajun Song; Yongbo Gai; Guojun Zhang; Zhechao Yu; Jianhe Lin; Xiaoxi Jiang; Guanjun Jiang

arXiv:2603.10101·cs.LG·March 12, 2026

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

Sijia Cui, Pengyu Cheng, Jiajun Song, Yongbo Gai, Guojun Zhang, Zhechao Yu, Jianhe Lin, Xiaoxi Jiang, Guanjun Jiang

PDF

Open Access

TL;DR

CLIPO introduces a contrastive learning approach into policy optimization for reinforcement learning with verifiable rewards, enhancing the robustness and generalization of large language models in reasoning tasks.

Contribution

It is the first to incorporate contrastive learning into RLVR, improving reasoning consistency and reducing hallucinations in LLMs during policy training.

Findings

01

Consistently improves RLVR baseline performance.

02

Enhances reasoning robustness and reduces hallucinations.

03

Demonstrates superior generalization across benchmarks.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)