TL;DR
This paper introduces WinoMTDE, a new dataset and evaluation method for assessing gender bias in German machine translation, revealing persistent biases across multiple systems and highlighting the relative performance of large language models.
Contribution
The paper presents WinoMTDE, the first gender bias evaluation dataset for German MT, extending existing methods and providing a large-scale bias assessment across several models.
Findings
Most MT systems exhibit gender bias.
Large language models outperform traditional systems.
Bias persists despite evaluation efforts.
Abstract
We present WinoMTDE, a new gender bias evaluation test set designed to assess occupational stereotyping and underrepresentation in German machine translation (MT) systems. Building on the automatic evaluation method introduced by arXiv:1906.00591v1, we extend the approach to German, a language with grammatical gender. The WinoMTDE dataset comprises 288 German sentences that are balanced in regard to gender, as well as stereotype, which was annotated using German labor statistics. We conduct a large-scale evaluation of five widely used MT systems and a large language model. Our results reveal persistent bias in most models, with the LLM outperforming traditional systems. The dataset and evaluation code are publicly available under https://github.com/michellekappl/mt_gender_german.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
