Choose Your Lenses: Flaws in Gender Bias Evaluation
Hadas Orgad, Yonatan Belinkov

TL;DR
This paper critically examines current gender bias evaluation methods in NLP, highlighting flaws such as over-reliance on intrinsic metrics and dataset-metric coupling, and proposes guidelines for more reliable assessments.
Contribution
It identifies key flaws in current gender bias evaluation practices and offers guidelines to improve the reliability and validity of bias measurement in NLP systems.
Findings
Extrinsic bias metrics are underused compared to intrinsic metrics.
Dataset and metric choices significantly influence bias measurement results.
Coupling of datasets and metrics hampers reliable bias assessment.
Abstract
Considerable efforts to measure and mitigate gender bias in recent years have led to the introduction of an abundance of tasks, datasets, and metrics used in this vein. In this position paper, we assess the current paradigm of gender bias evaluation and identify several flaws in it. First, we highlight the importance of extrinsic bias metrics that measure how a model's performance on some task is affected by gender, as opposed to intrinsic evaluations of model representations, which are less strongly connected to specific harms to people interacting with systems. We find that only a few extrinsic metrics are measured in most studies, although more can be measured. Second, we find that datasets and metrics are often coupled, and discuss how their coupling hinders the ability to obtain reliable conclusions, and how one may decouple them. We then investigate how the choice of the dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSex and Gender in Healthcare · Gender Politics and Representation
