Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF
Anand Siththaranjan, Cassidy Laidlaw, Dylan Hadfield-Menell

TL;DR
This paper reveals that preference learning in RLHF implicitly aggregates preferences via Borda count, leading to potential vulnerabilities, and proposes distributional preference learning to better account for hidden context and improve robustness.
Contribution
It formalizes how standard preference learning methods implicitly use Borda count, identifies associated vulnerabilities, and introduces distributional preference learning to mitigate these issues.
Findings
Standard preference learning aggregates preferences via Borda count.
Distributional preference learning reduces vulnerabilities in RLHF.
Experimental results show improved robustness and detection of hidden context.
Abstract
In practice, preference learning from human feedback depends on incomplete data with hidden context. Hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. This captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irrational behavior, and combining data labeled according to different criteria. We prove that standard applications of preference learning, including reinforcement learning from human feedback (RLHF), implicitly aggregate over hidden contexts according to a well-known voting rule called Borda count. We show this can produce counter-intuitive results that are very different from other methods which implicitly aggregate via expected utility. Furthermore, our analysis formalizes the way that preference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing
