Three Models of RLHF Annotation: Extension, Evidence, and Authority
Steve Coyne

TL;DR
This paper clarifies three conceptual models of human judgment roles in RLHF, analyzing their implications and proposing tailored annotation pipelines for improved alignment of language models.
Contribution
It introduces and distinguishes three models of human judgment in RLHF—extension, evidence, and authority—and offers normative criteria for their application.
Findings
Survey of landmark RLHF papers illustrating implicit use of models
Identification of failure modes from conflating models
Recommendation to decompose annotation into separable dimensions
Abstract
Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conceptual models of that role. The first is extension: annotators extend the system designers' own judgments about what outputs should be. The second is evidence: annotators provide independent evidence about some facts, whether moral, social or otherwise. The third is authority: annotators have some independent authority (as representatives of the broader population) to determine system outputs. I argue that these models have implications for how RLHF pipelines should solicit, validate and aggregate annotations. I survey landmark papers in the literature on RLHF and related methods to illustrate how they implicitly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
