Mitigating Cognitive Bias in RLHF by Altering Rationality
Tiffany Horter, Andrew Markham, Niki Trigoni, Serena Booth

TL;DR
This paper proposes a method to improve reinforcement learning from human feedback by dynamically adjusting the rationality parameter to account for cognitive biases in human judgments, leading to more reliable models.
Contribution
It introduces a novel approach to adaptively modify the rationality parameter during reward learning using an LLM to detect biases, enhancing model robustness.
Findings
The method produces more rational models despite biased datasets.
Dynamic adjustment of the rationality parameter improves reward model accuracy.
The approach effectively identifies and downweights biased human preferences.
Abstract
How can we make models robust to even imperfect human feedback? In reinforcement learning from human feedback (RLHF), human preferences over model outputs are used to train a reward model that assigns scalar values to responses. Because these rewards are inferred from pairwise comparisons, this learning depends on an assumed relationship between latent reward differences and observed preferences, typically modeled using a Boltzmann formulation in which a rationality parameter beta informs how consistently preferences reflect reward differences. In practice, beta is typically treated as a fixed constant that reflects assumed uniform annotator reliability. However, human feedback is not this simplistic in practice: real human judgments are shaped by cognitive biases, leading to systematic deviations from reward-consistent behavior that arise contextually. To address this, we treat…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
