Debiasing Online Preference Learning via Preference Feature Preservation

Dongyoung Kim; Jinsung Yoon; Jinwoo Shin; Jaehyung Kim

arXiv:2506.11098·cs.LG·June 16, 2025

Debiasing Online Preference Learning via Preference Feature Preservation

Dongyoung Kim, Jinsung Yoon, Jinwoo Shin, Jaehyung Kim

PDF

Open Access 3 Reviews

TL;DR

This paper introduces PFP, a framework that preserves human preference features during online learning to reduce bias and improve alignment of large language models.

Contribution

The paper proposes a novel Preference Feature Preservation framework that maintains human preference feature distributions during online learning, enhancing bias mitigation and model alignment.

Findings

01

PFP effectively reduces bias in preference features during online learning.

02

PFP outperforms previous methods on standard LLM alignment benchmarks.

03

The framework improves the handling of human preferences in LLM responses.

Abstract

Recent preference learning frameworks for large language models (LLMs) simplify human preferences with binary pairwise comparisons and scalar rewards. This simplification could make LLMs' responses biased to mostly preferred features, and would be exacerbated during the iterations of online preference learning steps. To address these challenges, we propose a novel framework coined PFP (Preference Feature Preservation). The key idea of PFP is maintaining the distribution of human preference features and utilizing such rich signals throughout the online preference learning process. Specifically, PFP first extract preference features from offline pairwise human preference data and trains a feature classifier. Then, using trained classifier and the distribution preserving optimization, PFP maps appropriate preference features for a new input instruction during online learning. Lastly, PFP…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

I like the number of ablations that are being performed, and the comparison with other length-controlled generation-based baselines. The experimental setup seems to be complete.

Weaknesses

Following are my questions that could improve this paper. - Notations in this paper can be heavily improved, in particular for the FE part. I think one should use vector notation for the label space, and simplex to denote output of the FE network. - I am still not fully sure about the motivation of this work. Can authors highlight cases where certain reward models prefer lengthy responses despite having incorrect answers? In my belief, as long as the answer is correct, and if the system promp

Reviewer 02Rating 5Confidence 3

Strengths

The paper introduces a unique approach, Preference Feature Preservation (PFP), for managing bias in preference learning. By explicitly incorporating preference features in the system prompts and maintaining feature distribution, it provides a fresh angle on bias mitigation that has not been explored in existing work.

Weaknesses

The paper introduces a set of predefined preference features, categorizing them into five distinct classes, which provides a structured framework for evaluating human preferences in various dimensions. However, in the main results, the experiments appear to primarily focus on addressing the length bias issue, leaving it unclear whether similar attention was given to the other identified preference classes. Were any experiments conducted to examine these additional preference aspects? Furthermor

Reviewer 03Rating 6Confidence 3

Strengths

1. Originality: The introduction of the PFP framework is a novel approach to addressing bias in online preference learning for large language models (LLMs). This approach to nearly resolving the length bias issue, which has been a long-standing problem in online preference learning. 2. Rigorous Experimental Design: The paper presents a well-structured set of experiments that validate the effectiveness of the PFP framework. The use of established benchmarks like AlpacaEval 2.0 and MT-Bench adds t

Weaknesses

1. Diversity of tasks: The paper primarily uses AlpacaEval 2.0 and MT-Bench for evaluation. While these are established benchmarks, the use of additional or more diverse datasets could strengthen the claims of the framework's effectiveness. For example, the preference features of math or coding tasks may be different, the author should give more insights on various tasks. 2. Comparative Analysis with State-of-the-Art Methods: The paper compares PFP with SFT, DPO and Iterative DPO but does not in

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Data Management and Algorithms · Web Data Mining and Analysis