Three Models of RLHF Annotation: Extension, Evidence, and Authority

Steve Coyne

arXiv:2604.25895·cs.CY·April 29, 2026

Three Models of RLHF Annotation: Extension, Evidence, and Authority

Steve Coyne

PDF

TL;DR

This paper clarifies three conceptual models of human judgment roles in RLHF, analyzing their implications and proposing tailored annotation pipelines for improved alignment of language models.

Contribution

It introduces and distinguishes three models of human judgment in RLHF—extension, evidence, and authority—and offers normative criteria for their application.

Findings

01

Survey of landmark RLHF papers illustrating implicit use of models

02

Identification of failure modes from conflating models

03

Recommendation to decompose annotation into separable dimensions

Abstract

Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conceptual models of that role. The first is extension: annotators extend the system designers' own judgments about what outputs should be. The second is evidence: annotators provide independent evidence about some facts, whether moral, social or otherwise. The third is authority: annotators have some independent authority (as representatives of the broader population) to determine system outputs. I argue that these models have implications for how RLHF pipelines should solicit, validate and aggregate annotations. I survey landmark papers in the literature on RLHF and related methods to illustrate how they implicitly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.