Aligning to Illusions: Choice Blindness in Human and AI Feedback

Wenbin Wu

arXiv:2603.08412·cs.CL·March 10, 2026

Aligning to Illusions: Choice Blindness in Human and AI Feedback

Wenbin Wu

PDF

Open Access

TL;DR

This paper reveals that both humans and AI models are susceptible to preference construction biases during feedback, which can undermine reinforcement learning processes and are difficult to detect with current evaluation methods.

Contribution

It demonstrates the prevalence of choice blindness in human and AI feedback, highlighting the limitations of current detection methods and the impact on RLHF effectiveness.

Findings

01

91% of preference swaps go undetected in humans

02

Detection in LLM judges depends on shallow text matching

03

Corruption of 50% of labels halts reward improvement

Abstract

Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Mobile Crowdsensing and Crowdsourcing · AI in Service Interactions