Social Bias in Popular Question-Answering Benchmarks
Angelie Kraft, Judith Simon, Sonja Schimmler

TL;DR
This paper critically examines popular QA and RC benchmarks, revealing significant social biases and lack of demographic representation, which may contribute to biases in large language models.
Contribution
It provides a comprehensive analysis of benchmark content and creation practices, highlighting biases and the lack of transparency in demographic representation.
Findings
Most benchmarks lack demographic diversity in creation and content.
Only one benchmark explicitly addresses social bias mitigation.
Gender, religion, and geographic biases are prevalent across multiple datasets.
Abstract
Question-answering (QA) and reading comprehension (RC) benchmarks are commonly used for assessing the capabilities of large language models (LLMs) to retrieve and reproduce knowledge. However, we demonstrate that popular QA and RC benchmarks do not cover questions about different demographics or regions in a representative way. We perform a content analysis of 30 benchmark papers and a quantitative analysis of 20 respective benchmark datasets to learn (1) who is involved in the benchmark creation, (2) whether the benchmarks exhibit social bias, or whether this is addressed or prevented, and (3) whether the demographics of the creators and annotators correspond to particular biases in the content. Most benchmark papers analyzed provide insufficient information about those involved in benchmark creation, particularly the annotators. Notably, just one (WinoGrande) explicitly reports…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems · Multimodal Machine Learning Applications
