Influencing Humans to Conform to Preference Models for RLHF
Stephane Hatgis-Kessell, W. Bradley Knox, Serena Booth, Peter Stone

TL;DR
This paper explores methods to influence how humans express preferences to better align with RLHF models, improving the quality of preference data without altering the underlying reward function.
Contribution
It introduces three interventions—visualizing underlying quantities, training for specific models, and modifying questions—to enhance human conformance to preference models in RLHF.
Findings
All interventions significantly affected human preference expression.
Interventions improved the alignment of human preferences with assumed models.
Practical tools were developed to enhance preference data quality for RLHF.
Abstract
Designing a reinforcement learning from human feedback (RLHF) algorithm to approximate a human's unobservable reward function requires assuming, implicitly or explicitly, a model of human preferences. A preference model that poorly describes how humans generate preferences risks learning a poor approximation of the human's reward function. In this paper, we conduct three human studies to asses whether one can influence the expression of real human preferences to more closely conform to a desired preference model. Importantly, our approach does not seek to alter the human's unobserved reward function. Rather, we change how humans use this reward function to generate preferences, such that they better match whatever preference model is assumed by a particular RLHF algorithm. We introduce three interventions: showing humans the quantities that underlie a preference model, which is normally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
