GG-BBQ: German Gender Bias Benchmark for Question Answering

Shalaka Satheesh; Katrin Klug; Katharina Beckh; H\'ector Allende-Cid; Sebastian Houben; Teena Hassan

arXiv:2507.16410·cs.CL·July 23, 2025

GG-BBQ: German Gender Bias Benchmark for Question Answering

Shalaka Satheesh, Katrin Klug, Katharina Beckh, H\'ector Allende-Cid, Sebastian Houben, Teena Hassan

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces GG-BBQ, a German gender bias benchmark for question answering, highlighting the importance of manual translation correction and revealing bias in several large language models.

Contribution

The paper presents a new German gender bias dataset for question answering, created through manual translation correction, and evaluates bias in multiple German NLP models.

Findings

01

Models exhibit gender bias and stereotypes

02

Manual translation correction is essential for dataset quality

03

Bias varies across different language models

Abstract

Within the context of Natural Language Processing (NLP), fairness evaluation is often associated with the assessment of bias and reduction of associated harm. In this regard, the evaluation is usually carried out by using a benchmark dataset, for a task such as Question Answering, created for the measurement of bias in the model's predictions along various dimensions, including gender identity. In our work, we evaluate gender bias in German Large Language Models (LLMs) using the Bias Benchmark for Question Answering by Parrish et al. (2022) as a reference. Specifically, the templates in the gender identity subset of this English dataset were machine translated into German. The errors in the machine translated templates were then manually reviewed and corrected with the help of a language expert. We find that manual revision of the translation is crucial when creating datasets for gender…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

shalakasatheesh/GG-BBQ
dataset· 24 dl
24 dl

Videos

GG-BBQ: German Gender Bias Benchmark for Question Answering· underline

Taxonomy

TopicsEuropean and International Law Studies