ABCD: All Biases Come Disguised

Mateusz Nowak; Xavier Cadet; Peter Chin

arXiv:2602.17445·cs.CL·February 20, 2026

ABCD: All Biases Come Disguised

Mateusz Nowak, Xavier Cadet, Peter Chin

PDF

Open Access

TL;DR

This paper identifies biases in multiple-choice question benchmarks for LLMs, proposes a bias-reduction evaluation protocol using uniform, unordered labels, and demonstrates improved robustness and lower variance across multiple models and benchmarks.

Contribution

It introduces a simple bias-reduction evaluation protocol that minimizes label-position and prompt biases in LLM assessments, enhancing robustness without significant performance loss.

Findings

01

Reduced accuracy variance by 3× across benchmarks

02

Improved robustness to answer permutations

03

Minimal performance decrease with bias mitigation

Abstract

Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distributions of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question. We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented. With a simple sentence similarity model, we demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in LLM's performance, exposing the LLM's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Expert finding and Q&A systems · Text and Document Classification Technologies