WinoPron: Revisiting English Winogender Schemas for Consistency, Coverage, and Grammatical Case
Vagrant Gautam, Julius Steuer, Eileen Bingert, Ray Johns, Anne, Lauscher, Dietrich Klakow

TL;DR
This paper introduces WinoPron, a corrected and expanded dataset for evaluating gender bias in coreference resolution, and demonstrates its effectiveness by analyzing state-of-the-art models and proposing a nuanced bias evaluation method.
Contribution
The paper identifies issues in the original Winogender Schemas, creates the improved WinoPron dataset, and introduces a new method for more detailed bias evaluation in coreference resolution.
Findings
Accusative pronouns are more difficult for models to resolve.
Bias varies across different pronoun surface forms.
WinoPron provides more reliable bias evaluation than previous datasets.
Abstract
While measuring bias and robustness in coreference resolution are important goals, such measurements are only as good as the tools we use to measure them. Winogender Schemas (Rudinger et al., 2018) are an influential dataset proposed to evaluate gender bias in coreference resolution, but a closer look reveals issues with the data that compromise its use for reliable evaluation, including treating different pronominal forms as equivalent, violations of template constraints, and typographical errors. We identify these issues and fix them, contributing a new dataset: WinoPron. Using WinoPron, we evaluate two state-of-the-art supervised coreference resolution systems, SpanBERT, and five sizes of FLAN-T5, and demonstrate that accusative pronouns are harder to resolve for all models. We also propose a new method to evaluate pronominal bias in coreference resolution that goes beyond the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
