Influencing Humans to Conform to Preference Models for RLHF

Stephane Hatgis-Kessell; W. Bradley Knox; Serena Booth; Peter Stone

arXiv:2501.06416·cs.LG·April 14, 2026

Influencing Humans to Conform to Preference Models for RLHF

Stephane Hatgis-Kessell, W. Bradley Knox, Serena Booth, Peter Stone

PDF

TL;DR

This paper explores methods to influence how humans express preferences to better align with RLHF models, improving the quality of preference data without altering the underlying reward function.

Contribution

It introduces three interventions—visualizing underlying quantities, training for specific models, and modifying questions—to enhance human conformance to preference models in RLHF.

Findings

01

All interventions significantly affected human preference expression.

02

Interventions improved the alignment of human preferences with assumed models.

03

Practical tools were developed to enhance preference data quality for RLHF.

Abstract

Designing a reinforcement learning from human feedback (RLHF) algorithm to approximate a human's unobservable reward function requires assuming, implicitly or explicitly, a model of human preferences. A preference model that poorly describes how humans generate preferences risks learning a poor approximation of the human's reward function. In this paper, we conduct three human studies to asses whether one can influence the expression of real human preferences to more closely conform to a desired preference model. Importantly, our approach does not seek to alter the human's unobserved reward function. Rather, we change how humans use this reward function to generate preferences, such that they better match whatever preference model is assumed by a particular RLHF algorithm. We introduce three interventions: showing humans the quantities that underlie a preference model, which is normally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.