taz2024full: Analysing German Newspapers for Gender Bias and Discrimination across Decades
Stefanie Urchs, Veronika Thurner, Matthias A{\ss}enmacher, Christian Heumann, Stephanie Thiemichen

TL;DR
This paper introduces taz2024full, the largest German newspaper corpus, enabling analysis of gender bias over four decades, revealing persistent male overrepresentation and recent progress towards balance.
Contribution
It provides the largest publicly available German newspaper corpus and demonstrates its use in analyzing gender bias and societal trends over time.
Findings
Men are overrepresented in reporting
Recent years show increased gender balance
Corpus supports diverse NLP and social science research
Abstract
Open-access corpora are essential for advancing natural language processing (NLP) and computational social science (CSS). However, large-scale resources for German remain limited, restricting research on linguistic trends and societal issues such as gender bias. We present taz2024full, the largest publicly available corpus of German newspaper articles to date, comprising over 1.8 million texts from taz, spanning 1980 to 2024. As a demonstration of the corpus's utility for bias and discrimination research, we analyse gender representation across four decades of reporting. We find a consistent overrepresentation of men, but also a gradual shift toward more balanced coverage in recent years. Using a scalable, structured analysis pipeline, we provide a foundation for studying actor mentions, sentiment, and linguistic framing in German journalistic texts. The corpus supports a wide range…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAuthorship Attribution and Profiling · Gender Studies in Language · Computational and Text Analysis Methods
