Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora
Stefanie Urchs, Veronika Thurner, Matthias A{\ss}enmacher, Christian Heumann, Stephanie Thiemichen

TL;DR
This paper presents a novel actor-based pipeline for detecting and reducing gender discrimination in large-scale news text corpora, improving fairness while maintaining core content dynamics.
Contribution
It introduces a discourse-aware, actor-level method combining sentiment, syntactic agency, and quotation analysis for auditing and balancing gender representation in corpora.
Findings
Structural gender asymmetries can be reduced through systematic filtering.
Subtler biases in sentiment and framing persist after balancing.
The pipeline effectively creates more gender-balanced datasets.
Abstract
Language corpora are the foundation of most natural language processing research, yet they often reproduce structural inequalities. One such inequality is gender discrimination in how actors are represented, which can distort analyses and perpetuate discriminatory outcomes. This paper introduces a user-centric, actor-level pipeline for detecting and mitigating gender discrimination in large-scale text corpora. By combining discourse-aware analysis with metrics for sentiment, syntactic agency, and quotation styles, our method enables both fine-grained auditing and exclusion-based balancing. Applied to the taz2024full corpus of German newspaper articles (1980-2024), the pipeline yields a more gender-balanced dataset while preserving core dynamics of the source material. Our findings show that structural asymmetries can be reduced through systematic filtering, though subtler biases in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGender Studies in Language · Authorship Attribution and Profiling · Hate Speech and Cyberbullying Detection
