Biases in the Blind Spot: Detecting What LLMs Fail to Mention
Iv\'an Arcuschin, David Chanin, Adri\`a Garriga-Alonso, Oana-Maria Camburu

TL;DR
This paper presents an automated, black-box method for detecting hidden, task-specific biases in large language models by analyzing their reasoning traces without relying on predefined bias categories.
Contribution
The authors introduce a novel pipeline that automatically uncovers unverbalized biases in LLMs, surpassing traditional manual bias detection methods.
Findings
Automatically discovers previously unknown biases such as language fluency and formality.
Validates known biases like gender, race, and religion.
Works across multiple models and decision tasks.
Abstract
Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these *unverbalized biases*. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Artificial Intelligence in Healthcare and Education
