ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

Jinpeng Wang; Chao Li; Ting Ye; Mengyuan Zhang; Wei Liu; Jian Luan

arXiv:2511.21005·cs.AI·January 14, 2026

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

Jinpeng Wang, Chao Li, Ting Ye, Mengyuan Zhang, Wei Liu, Jian Luan

PDF

Open Access

TL;DR

ICPO introduces a confidence-driven preference optimization method that improves reinforcement learning for large language models by addressing reward issues and enhancing reasoning capabilities.

Contribution

It proposes a novel preference advantage score based on response probabilities to guide exploration and improve training stability in RLVR.

Findings

01

ICPO outperforms existing methods on multiple benchmarks.

02

It effectively reduces reward noise and overconfidence errors.

03

ICPO enhances reasoning accuracy across diverse tasks.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates significant potential in enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing RLVR methods are often constrained by issues such as coarse-grained rewards, reward noise, and inefficient exploration, which lead to unstable training and entropy collapse. To address this challenge, we propose the Intrinsic Confidence-Driven Group Relative Preference Optimization method (ICPO). The intuition behind it lies in the fact that the probabilities of an LLM generating different responses can inherently and directly reflect its self-assessment of the reasoning process. Inspired by the idea of preference modeling, ICPO calculates a preference advantage score for each response by comparing the relative generation probabilities of multiple responses under the same input prompt, and integrates this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications