Reward Learning From Preference With Ties
Jinsong Liu, Dongdong Ge, Ruihao Zhu

TL;DR
This paper introduces the Bradley-Terry model with ties (BTT) for preference learning in RLHF, demonstrating that accounting for ties improves the accuracy of preference strength measurement and enhances model fine-tuning performance.
Contribution
The paper proposes the generalized Bradley-Terry model with ties (BTT) for better preference modeling, addressing bias issues caused by ignoring ties in human preference data.
Findings
Incorporating ties reduces bias in preference strength estimation.
Fine-tuning with BTT outperforms traditional BT on synthetic datasets.
Accounting for ties improves reward modeling accuracy.
Abstract
Reward learning plays a pivotal role in Reinforcement Learning from Human Feedback (RLHF), ensuring the alignment of language models. The Bradley-Terry (BT) model stands as the prevalent choice for capturing human preferences from datasets containing pairs of chosen and rejected responses. In preference modeling, the focus is not on absolute values but rather on the reward difference between chosen and rejected responses, referred to as preference strength. Thus, precise evaluation of preference strength holds paramount importance in preference modeling. However, an easily overlooked factor significantly affecting preference strength measurement is that human attitudes towards two responses may not solely indicate a preference for one over the other and ties are also a common occurrence. To address this, we propose the adoption of the generalized Bradley-Terry model -- the Bradley-Terry…
Peer Reviews
Decision·Submitted to ICLR 2025
1. This paper proposes the BTT model in RLHF for more reliable preference learning. 2. This paper provide the theoretical results for the readers to better understand the effectiveness of the proposed method. 3. This paper is well written and easy to be understanded.
1. **Method:** This paper based on the problem setting of the existence of ties, but the problem setting needs to relabel the preference data for the debias method, which is not suitable for most existing public preference datasets. For this problem, I think the author should focus on the noisy label of the preference dataset, rather than relabeling the data and make some analysis. 2. **Theory:** The theoretical analysis is based on the ground truth label of ties, rather than the noisy label mo
It is interesting to notice ties might exist in current preference dataset, which might bring up some issue in current preference optimization approaches. The paper is easy to follow.
I do not see the necessity to introduce algorithm to deal with ties from Table 1. As stated in Lines 56-58, the reward models are trained with BT assumption. And Table 1 shows that these reward models could not distinguish the ties in the datasets due to 0 preference strength. However, the chosen and rejected responses in Table 1 are both suitable as the preferred response. The reward models of BT assign similar rewards to the chosen and rejected responses, which means the BT reward model could
- Provides theoretical insights on the importance of incorporating ties in preference modeling - Proposes an algorithm to address model mismatch problems in conventional preference datasets that lack tie annotations - Conducts comprehensive experiments to verify the proposed methodology
Importance of the problem: - While incorporating ties in preference modeling is conceptually important, questions arise about its practical significance. If ties occur infrequently in real annotations, as noted in lines 457-459 ("We observe that Anthropic's HH-RLHF dataset contains over 160k samples, with only a small portion labeled as ties"), the significance of this issue may be limited. - Moreover, how does the significance of ties change with data scaling? Would the impact of ties dimin
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Experimental Behavioral Economics Studies
MethodsFocus
