Aligning to Illusions: Choice Blindness in Human and AI Feedback
Wenbin Wu

TL;DR
This paper reveals that both humans and AI models are susceptible to preference construction biases during feedback, which can undermine reinforcement learning processes and are difficult to detect with current evaluation methods.
Contribution
It demonstrates the prevalence of choice blindness in human and AI feedback, highlighting the limitations of current detection methods and the impact on RLHF effectiveness.
Findings
91% of preference swaps go undetected in humans
Detection in LLM judges depends on shallow text matching
Corruption of 50% of labels halts reward improvement
Abstract
Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Mobile Crowdsensing and Crowdsourcing · AI in Service Interactions
