Reward Learning From Preference With Ties

Jinsong Liu; Dongdong Ge; Ruihao Zhu

arXiv:2410.05328·cs.LG·October 10, 2024

Reward Learning From Preference With Ties

Jinsong Liu, Dongdong Ge, Ruihao Zhu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces the Bradley-Terry model with ties (BTT) for preference learning in RLHF, demonstrating that accounting for ties improves the accuracy of preference strength measurement and enhances model fine-tuning performance.

Contribution

The paper proposes the generalized Bradley-Terry model with ties (BTT) for better preference modeling, addressing bias issues caused by ignoring ties in human preference data.

Findings

01

Incorporating ties reduces bias in preference strength estimation.

02

Fine-tuning with BTT outperforms traditional BT on synthetic datasets.

03

Accounting for ties improves reward modeling accuracy.

Abstract

Reward learning plays a pivotal role in Reinforcement Learning from Human Feedback (RLHF), ensuring the alignment of language models. The Bradley-Terry (BT) model stands as the prevalent choice for capturing human preferences from datasets containing pairs of chosen and rejected responses. In preference modeling, the focus is not on absolute values but rather on the reward difference between chosen and rejected responses, referred to as preference strength. Thus, precise evaluation of preference strength holds paramount importance in preference modeling. However, an easily overlooked factor significantly affecting preference strength measurement is that human attitudes towards two responses may not solely indicate a preference for one over the other and ties are also a common occurrence. To address this, we propose the adoption of the generalized Bradley-Terry model -- the Bradley-Terry…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

1. This paper proposes the BTT model in RLHF for more reliable preference learning. 2. This paper provide the theoretical results for the readers to better understand the effectiveness of the proposed method. 3. This paper is well written and easy to be understanded.

Weaknesses

1. **Method:** This paper based on the problem setting of the existence of ties, but the problem setting needs to relabel the preference data for the debias method, which is not suitable for most existing public preference datasets. For this problem, I think the author should focus on the noisy label of the preference dataset, rather than relabeling the data and make some analysis. 2. **Theory:** The theoretical analysis is based on the ground truth label of ties, rather than the noisy label mo

Reviewer 02Rating 3Confidence 3

Strengths

It is interesting to notice ties might exist in current preference dataset, which might bring up some issue in current preference optimization approaches. The paper is easy to follow.

Weaknesses

I do not see the necessity to introduce algorithm to deal with ties from Table 1. As stated in Lines 56-58, the reward models are trained with BT assumption. And Table 1 shows that these reward models could not distinguish the ties in the datasets due to 0 preference strength. However, the chosen and rejected responses in Table 1 are both suitable as the preferred response. The reward models of BT assign similar rewards to the chosen and rejected responses, which means the BT reward model could

Reviewer 03Rating 3Confidence 3

Strengths

- Provides theoretical insights on the importance of incorporating ties in preference modeling - Proposes an algorithm to address model mismatch problems in conventional preference datasets that lack tie annotations - Conducts comprehensive experiments to verify the proposed methodology

Weaknesses

Importance of the problem: - While incorporating ties in preference modeling is conceptually important, questions arise about its practical significance. If ties occur infrequently in real annotations, as noted in lines 457-459 ("We observe that Anthropic's HH-RLHF dataset contains over 160k samples, with only a small portion labeled as ties"), the significance of this issue may be limited. - Moreover, how does the significance of ties change with data scaling? Would the impact of ties dimin

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Experimental Behavioral Economics Studies

MethodsFocus