A Descriptive and Normative Theory of Human Beliefs in RLHF
Sylee Dandekar, Shripad Deshmukh, Frank Chiu, W. Bradley Knox, Scott Niekum

TL;DR
This paper explores how human beliefs about agent capabilities influence preferences in RLHF, proposing a new model and normative bounds to improve alignment and performance.
Contribution
It introduces a preference model incorporating human beliefs, provides a normative theory bounding policy error based on belief mismatch, and empirically demonstrates the impact of beliefs on preferences.
Findings
Beliefs about agent capabilities significantly affect human preferences.
Simple interventions can influence human beliefs about agent capabilities.
Reducing belief mismatch improves RLHF performance.
Abstract
Human preferences in RLHF are typically modeled as a function of the human's reward function or corresponding optimal state-action values. In this work, we propose that human beliefs about the capabilities of the agent being trained also play a key role in preference generation. We examine two questions related to this hypothesis, one descriptive and one normative, respectively: Do human labelers' beliefs about agent capabilities affect the preferences that they provide? And what is the ideal set of beliefs about an agent -- and resulting preferences -- for humans to have? We propose a new preference model that incorporates human beliefs and provide a normative theory that bounds the error on the final learned policy based on the \textit{mismatch} between the human's beliefs and an idealized set of beliefs. We then confirm via a human study that beliefs about agent capabilities do, in…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper introduces a belief based model where the preferences over trajectories is proportional to the differences between advantages of the trajectories under the belief policy (while prior works assume an optimal policy). The anecdotal example is very intuitive and clearly presents the problem. - The authors present a theoretical lower bound on the policy performance as the annotators deviate from the optimal beliefs. This misalignment leads to sub-optimal performance as shown in the grid
- The paper lacks experiments at scale w.r.t to policy learning to demonstrate the effectiveness of the modelling agent belief capabilities. For eg. in the CARLA experiments, it would be interesting to show the return and qualitative performance of the agent when trained using preferences from different priming groups. This would show that the same policy under different beliefs converges to different performances (but a scalable experiment for the same). - It would be interesting to hear the au
1. The paper investigates an important, under-explored problem. RLHF typically assumes preferences reflect an (implicit) optimality/regret model. Modeling how beliefs about agent capability influence preferences is a realistic and consequential gap—especially for safety-critical settings where an over-confident preference can induce risky policies. This framing is timely and relevant to the community. 2. The paper provides theoretical guarantee tying disagreement to expected return. Theorem 4.3
1. The paper makes strong simplifications in the theory, which limit practical applicability. The theoretical bound (Theorem 4.3) assumes noiseless preferences, deterministic tabular policies, single-transition preference pairs, and that RLHF yields a deterministic policy that exactly respects all preferences. These assumptions are overly idealistic and far from the modern practice of RLHF (noisy annotators, function approximation, and finite data). The practical implication of the bound in real
* The paper is mostly well-written, clearly organized, and easy to follow. It formalizes the agent-labeler agreement clearly, which, based on my knowledge, hasn't been discussed much in previous literature. * Human study is conducted as evidence that preferences can be shifted in a statistically significant way, which confirms the motivation of this paper.
* The recommendations given by the authors in Sec 6.1 doesn't seem to be actionable to me (e.g., inform labelers of the known limitation). It still leaves out the problem of *how* or *what* to disclose. As the authors have demonstrated through their human study experiment, these choices can impact the preference in a significant way. * Section 4 gives us a theory for the effect of human-labeler disagreement, but the authors haven't shown how this theory can be applied for designing any operatio
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTheology and Philosophy of Evil · Epistemology, Ethics, and Metaphysics
