ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization

Hee Suk Yoon; Eunseop Yoon; Mark Hasegawa-Johnson; Sungwoong Kim; and Chang D. Yoo

arXiv:2506.08712·cs.CL·June 13, 2025

ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization

Hee Suk Yoon, Eunseop Yoon, Mark Hasegawa-Johnson, Sungwoong Kim, and Chang D. Yoo

PDF

Open Access 1 Repo

TL;DR

ConfPO is a lightweight, model-free method that improves preference alignment in large language models by focusing optimization on critical tokens identified through policy confidence, outperforming prior uniform adjustment methods.

Contribution

It introduces ConfPO, a novel token selection approach based solely on policy confidence, enhancing alignment quality without auxiliary models or extra computation.

Findings

01

ConfPO outperforms uniform DAAs on benchmark tasks.

02

It achieves better alignment with zero additional computational cost.

03

ConfPO reduces overoptimization and reward hacking.

Abstract

We introduce ConfPO, a method for preference learning in Large Language Models (LLMs) that identifies and optimizes preference-critical tokens based solely on the training policy's confidence, without requiring any auxiliary models or compute. Unlike prior Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO), which uniformly adjust all token probabilities regardless of their relevance to preference, ConfPO focuses optimization on the most impactful tokens. This targeted approach improves alignment quality while mitigating overoptimization (i.e., reward hacking) by using the KL divergence budget more efficiently. In contrast to recent token-level methods that rely on credit-assignment models or AI annotators, raising concerns about scalability and reliability, ConfPO is simple, lightweight, and model-free. Experimental results on challenging alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hee-suk-yoon/confpo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Explainable Artificial Intelligence (XAI) · Recommender Systems and Techniques