Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models
Ayush Rajesh Jhaveri, Anthony GX-Chen, Ilia Sucholutsky, Eunsol Choi

TL;DR
This paper investigates confirmation bias in large language models through a rule-discovery task, demonstrating that bias hampers their reasoning and can be mitigated with human-inspired interventions.
Contribution
It reveals confirmation bias in LLMs during hypothesis testing and introduces intervention strategies that improve their reasoning and rule discovery performance.
Findings
LLMs exhibit confirmation bias, favoring hypothesis-confirming triples.
Prompting with counterexample considerations reduces confirmation bias.
Intervention distillation improves LLMs' generalization to new tasks.
Abstract
Confirmation bias, the tendency to seek evidence that supports rather than challenges one's belief, hinders one's reasoning ability. We examine whether large language models (LLMs) exhibit confirmation bias by adapting the rule-discovery study from human psychology: given a sequence of three numbers (a "triple"), an agent engages in an interactive feedback loop where it (1) proposes a new triple, (2) receives feedback on whether it satisfies the hidden rule, and (3) guesses the rule. Across eleven LLMs of multiple families and scales, we find that LLMs exhibit confirmation bias, often proposing triples to confirm their hypothesis rather than trying to falsify it. This leads to slower and less frequent discovery of the hidden rule. We further explore intervention strategies (e.g., encouraging the agent to consider counter examples) developed for humans. We find prompting LLMs with such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
