Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models
Brian Christian, Matan Mazor

TL;DR
This paper introduces a novel method for large language models to mitigate biases and sycophancy by using self-blinding and counterfactual self-simulation, improving fairness and transparency in decision-making.
Contribution
The paper proposes a new approach where LLMs access a ground-truth model of their own cognition to reduce biases, a technique not previously explored.
Findings
Self-blinding reduces gender and race biases in LLM decisions.
Counterfactual self-simulation improves fairness and transparency.
The method occasionally backfires but generally outperforms prompting alone.
Abstract
Fair decisions require ignoring irrelevant, potentially biasing, information. To achieve this, decision-makers need to approximate what decision they would have made had they not known certain facts, such as the gender or race of a job candidate. This counterfactual self-simulation is notoriously hard for humans, leading to biased judgments even by well-meaning actors. Here we show that large language models (LLMs) suffer from similar limitations in their ability to approximate what decisions they would make under counterfactual knowledge in offsetting gender and race biases and overcoming sycophancy. We show that prompting models to ignore or pretend not to know biasing information fails to offset these biases and occasionally backfires. However, unlike humans, LLMs can be given access to a ground-truth model of their own counterfactual cognition -- their own API. We show that this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI · Language and cultural evolution
