Confirmation bias: A challenge for scalable oversight

Gabriel Recchia; Chatrik Singh Mangat; Jinu Nyachhyon; Mridul Sharma; Callum Canavan; Dylan Epstein-Gross; Muhammed Abdulbari

arXiv:2507.19486·cs.HC·July 29, 2025

Confirmation bias: A challenge for scalable oversight

Gabriel Recchia, Chatrik Singh Mangat, Jinu Nyachhyon, Mridul Sharma, Callum Canavan, Dylan Epstein-Gross, Muhammed Abdulbari

PDF

1 Video

TL;DR

This paper investigates how human biases affect the effectiveness of simple oversight protocols in AI model evaluation, revealing limitations and the need for more robust oversight methods as AI capabilities grow.

Contribution

It provides empirical evidence that simple oversight protocols are vulnerable to human biases and highlights the importance of testing their robustness against evaluator errors.

Findings

01

Simple protocols show no overall advantage in oversight accuracy.

02

Evaluator confidence increases after online research, even with incorrect answers.

03

Knowledge gaps in evaluators diminish protocol effectiveness as models scale.

Abstract

Scalable oversight protocols aim to empower evaluators to accurately verify AI models more capable than themselves. However, human evaluators are subject to biases that can lead to systematic errors. We conduct two studies examining the performance of simple oversight protocols where evaluators know that the model is "correct most of the time, but not all of the time". We find no overall advantage for the tested protocols, although in Study 1, showing arguments in favor of both answers improves accuracy in cases where the model is incorrect. In Study 2, participants in both groups become more confident in the system's answers after conducting online research, even when those answers are incorrect. We also reanalyze data from prior work that was more optimistic about simple protocols, finding that human evaluators possessing knowledge absent from models likely contributed to their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Confirmation Bias: A Challenge for Scalable Oversight· underline