EuroGEST: Investigating gender stereotypes in multilingual language models
Jacqueline Rowe, Mateusz Klimaszewski, Liane Guillou, Shannon Vallor, Alexandra Birch

TL;DR
EuroGEST is a multilingual benchmark dataset that measures gender stereotypes in large language models across 30 European languages, revealing persistent stereotypes and the influence of model size and fine-tuning.
Contribution
This work introduces EuroGEST, a novel multilingual dataset for assessing gender bias in LLMs, expanding existing benchmarks with translation and heuristics, and providing insights into stereotype encoding across languages.
Findings
Larger models encode stereotypes more strongly.
Instruction fine-tuning does not reliably reduce stereotypes.
Women are stereotypically associated with 'beautiful', 'empathetic', 'neat'.
Abstract
Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric. We introduce EuroGEST, a dataset designed to measure gender-stereotypical reasoning in LLMs across English and 29 European languages. EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality estimation metrics, and morphological heuristics. Human evaluations confirm that our data generation method results in high accuracy of both translations and gender labels across languages. We use EuroGEST to evaluate 24 multilingual language models from six model families, demonstrating that the strongest stereotypes in all models across all languages are that women are 'beautiful', 'empathetic' and 'neat' and men are 'leaders', 'strong, tough' and 'professional'. We also show that larger…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGender Studies in Language · Computational and Text Analysis Methods · Topic Modeling
