Evaluating Machine Translation Datasets for Low-Web Data Languages: A Gendered Lens

Hellina Hailu Nigatu; Bethelhem Yemane Mamo; Bontu Fufa Balcha; Debora Taye Tesfaye; Elbethel Daniel Zewdie; Ikram Behiru Nesiru; Jitu Ewnetu Hailu; Senait Mengesha Yayo

arXiv:2511.03880·cs.CL·November 7, 2025

Evaluating Machine Translation Datasets for Low-Web Data Languages: A Gendered Lens

Hellina Hailu Nigatu, Bethelhem Yemane Mamo, Bontu Fufa Balcha, Debora Taye Tesfaye, Elbethel Daniel Zewdie, Ikram Behiru Nesiru, Jitu Ewnetu Hailu, Senait Mengesha Yayo

PDF

Open Access

TL;DR

This paper critically examines the quality of machine translation datasets for low-resource languages, highlighting gender biases, harmful content, and the disparity between dataset focus and societal impact.

Contribution

It provides an analysis of gender representation and harmful content in MT datasets for Afan Oromo, Amharic, and Tigrinya, emphasizing the importance of dataset quality over quantity.

Findings

01

Datasets are skewed towards political, religious, news, health, and sports domains.

02

Significant gender bias towards males in names, verbs, and stereotypes.

03

Presence of harmful depictions against women, especially in larger datasets.

Abstract

As low-resourced languages are increasingly incorporated into NLP research, there is an emphasis on collecting large-scale datasets. But in prioritizing quantity over quality, we risk 1) building language technologies that perform poorly for these languages and 2) producing harmful content that perpetuates societal biases. In this paper, we investigate the quality of Machine Translation (MT) datasets for three low-resourced languages--Afan Oromo, Amharic, and Tigrinya, with a focus on the gender representation in the datasets. Our findings demonstrate that while training data has a large representation of political and religious domain text, benchmark datasets are focused on news, health, and sports. We also found a large skew towards the male gender--in names of persons, the grammatical gender of verbs, and in stereotypical depictions in the datasets. Further, we found harmful and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Ethics and Social Impacts of AI · Topic Modeling