GenderBench: Evaluation Suite for Gender Biases in LLMs

Mat\'u\v{s} Pikuliak

arXiv:2505.12054·cs.CL·May 20, 2025

GenderBench: Evaluation Suite for Gender Biases in LLMs

Mat\'u\v{s} Pikuliak

PDF

Open Access 1 Repo

TL;DR

GenderBench is an open-source evaluation suite that measures gender biases in large language models across multiple dimensions, revealing consistent stereotypical and discriminatory behaviors in current models.

Contribution

It introduces a comprehensive, extensible toolkit for assessing gender biases in LLMs, with evaluations on 12 models highlighting prevalent biases and challenges.

Findings

01

LLMs exhibit stereotypical reasoning patterns.

02

Models show biases in gender representation.

03

Discriminatory behaviors occur in high-stakes scenarios.

Abstract

We present GenderBench -- a comprehensive evaluation suite designed to measure gender biases in LLMs. GenderBench includes 14 probes that quantify 19 gender-related harmful behaviors exhibited by LLMs. We release GenderBench as an open-source and extensible library to improve the reproducibility and robustness of benchmarking across the field. We also publish our evaluation of 12 LLMs. Our measurements reveal consistent patterns in their behavior. We show that LLMs struggle with stereotypical reasoning, equitable gender representation in generated texts, and occasionally also with discriminatory behavior in high-stakes scenarios, such as hiring.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

matus-pikuliak/genderbench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Topic Modeling · Ethics and Social Impacts of AI

MethodsLib