Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

Yunqiao Yang; Houxing Ren; Zimu Lu; Ke Wang; Weikang Shi; Aojun Zhou; Junting Pan; Mingjie Zhan; Hongsheng Li

arXiv:2505.23540·cs.CL·May 30, 2025

Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

Yunqiao Yang, Houxing Ren, Zimu Lu, Ke Wang, Weikang Shi, Aojun Zhou, Junting Pan, Mingjie Zhan, Hongsheng Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces PCPO, a new preference optimization framework for large language models that considers both answer correctness and internal token probability consistency, leading to improved reasoning performance.

Contribution

The paper proposes a novel dual-metric framework, PCPO, which incorporates internal probability consistency into preference optimization for LLMs, enhancing reasoning capabilities.

Findings

01

PCPO outperforms outcome-only methods across various benchmarks.

02

Incorporating token-level probability consistency improves LLM reasoning.

03

Extensive experiments validate the effectiveness of PCPO.

Abstract

Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yunqiaoyang/pcpo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Topic Modeling · Natural Language Processing Techniques