Social Bias in Popular Question-Answering Benchmarks

Angelie Kraft; Judith Simon; Sonja Schimmler

arXiv:2505.15553·cs.CL·January 8, 2026

Social Bias in Popular Question-Answering Benchmarks

Angelie Kraft, Judith Simon, Sonja Schimmler

PDF

Open Access

TL;DR

This paper critically examines popular QA and RC benchmarks, revealing significant social biases and lack of demographic representation, which may contribute to biases in large language models.

Contribution

It provides a comprehensive analysis of benchmark content and creation practices, highlighting biases and the lack of transparency in demographic representation.

Findings

01

Most benchmarks lack demographic diversity in creation and content.

02

Only one benchmark explicitly addresses social bias mitigation.

03

Gender, religion, and geographic biases are prevalent across multiple datasets.

Abstract

Question-answering (QA) and reading comprehension (RC) benchmarks are commonly used for assessing the capabilities of large language models (LLMs) to retrieve and reproduce knowledge. However, we demonstrate that popular QA and RC benchmarks do not cover questions about different demographics or regions in a representative way. We perform a content analysis of 30 benchmark papers and a quantitative analysis of 20 respective benchmark datasets to learn (1) who is involved in the benchmark creation, (2) whether the benchmarks exhibit social bias, or whether this is addressed or prevented, and (3) whether the demographics of the creators and annotators correspond to particular biases in the content. Most benchmark papers analyzed provide insufficient information about those involved in benchmark creation, particularly the annotators. Notably, just one (WinoGrande) explicitly reports…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Expert finding and Q&A systems · Multimodal Machine Learning Applications