ABCD: All Biases Come Disguised
Mateusz Nowak, Xavier Cadet, Peter Chin

TL;DR
This paper identifies biases in multiple-choice question benchmarks for LLMs, proposes a bias-reduction evaluation protocol using uniform, unordered labels, and demonstrates improved robustness and lower variance across multiple models and benchmarks.
Contribution
It introduces a simple bias-reduction evaluation protocol that minimizes label-position and prompt biases in LLM assessments, enhancing robustness without significant performance loss.
Findings
Reduced accuracy variance by 3× across benchmarks
Improved robustness to answer permutations
Minimal performance decrease with bias mitigation
Abstract
Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distributions of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question. We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented. With a simple sentence similarity model, we demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in LLM's performance, exposing the LLM's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems · Text and Document Classification Technologies
