Distributional Preference Learning: Understanding and Accounting for   Hidden Context in RLHF

Anand Siththaranjan; Cassidy Laidlaw; Dylan Hadfield-Menell

arXiv:2312.08358·cs.LG·April 18, 2024·2 cites

Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF

Anand Siththaranjan, Cassidy Laidlaw, Dylan Hadfield-Menell

PDF

Open Access 1 Repo

TL;DR

This paper reveals that preference learning in RLHF implicitly aggregates preferences via Borda count, leading to potential vulnerabilities, and proposes distributional preference learning to better account for hidden context and improve robustness.

Contribution

It formalizes how standard preference learning methods implicitly use Borda count, identifies associated vulnerabilities, and introduces distributional preference learning to mitigate these issues.

Findings

01

Standard preference learning aggregates preferences via Borda count.

02

Distributional preference learning reduces vulnerabilities in RLHF.

03

Experimental results show improved robustness and detection of hidden context.

Abstract

In practice, preference learning from human feedback depends on incomplete data with hidden context. Hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. This captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irrational behavior, and combining data labeled according to different criteria. We prove that standard applications of preference learning, including reinforcement learning from human feedback (RLHF), implicitly aggregate over hidden contexts according to a well-known voting rule called Borda count. We show this can produce counter-intuitive results that are very different from other methods which implicitly aggregate via expected utility. Furthermore, our analysis formalizes the way that preference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cassidylaidlaw/hidden-context
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing