Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets

Mahdi Zakizadeh; Mohammad Taher Pilehvar

arXiv:2501.01168·cs.CL·September 25, 2025

Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets

Mahdi Zakizadeh, Mohammad Taher Pilehvar

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper highlights the complexity of measuring gender stereotypes in language models, showing that current benchmarks are incomplete and proposing data balancing techniques to improve bias detection accuracy.

Contribution

It reveals the limitations of existing benchmarks and introduces a framework for balancing data to better capture gender stereotypes in language models.

Findings

01

Balancing data improves correlation between stereotype benchmarks.

02

Current benchmarks only capture partial facets of gender bias.

03

Simple balancing techniques can significantly enhance bias measurement.

Abstract

Accurately measuring gender stereotypical bias in language models is a complex task with many hidden aspects. Current benchmarks have underestimated this multifaceted challenge and failed to capture the full extent of the problem. This paper examines the inconsistencies between intrinsic stereotype benchmarks. We propose that currently available benchmarks each capture only partial facets of gender stereotypes, and when considered in isolation, they provide just a fragmented view of the broader landscape of bias in language models. Using StereoSet and CrowS-Pairs as case studies, we investigated how data distribution affects benchmark results. By applying a framework from social psychology to balance the data of these benchmarks across various components of gender stereotypes, we demonstrated that even simple balancing techniques can significantly improve the correlation between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

teias-ai/BMNE
dataset· 18 dl
18 dl

Videos

Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets· underline

Taxonomy

TopicsGender Politics and Representation