Measuring Human Preferences in RLHF is a Social Science Problem
Bijean Ghafouri, Eun Cheol Choi, Priyanka Dey, Emilio Ferrara

TL;DR
This paper argues that measuring human preferences in RLHF is fundamentally a social science issue, emphasizing the need for diagnostic tools to distinguish genuine preferences from artifacts.
Contribution
It introduces a taxonomy and diagnostic framework to identify genuine preferences versus non-attitudes and artifacts in RLHF measurement.
Findings
Behavioral science reveals responses often lack genuine opinions.
Preferences are constructed on the spot based on context.
Current RLHF may be modeling noise as human values.
Abstract
RLHF assumes that annotation responses reflect genuine human preferences. We argue this assumption warrants systematic examination, and that behavioral science offers frameworks that bring clarity to when it holds and when it breaks down. Behavioral scientists have documented for sixty years that people routinely produce responses without holding genuine opinions, construct preferences on the spot based on contextual cues, and interpret identical questions differently. These phenomena are pervasive for precisely the value-laden judgments that matter most for alignment, yet this literature has not yet been systematically integrated into ML practice. We argue that the ML community must treat measurement validity as logically prior to preference aggregation. Specifically, we contend that measuring human preferences in RLHF is a social science problem. We present a taxonomy distinguishing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
