The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses
Bashar Alhafni, Nizar Habash, Houda Bouamor

TL;DR
This paper introduces an expanded Arabic gender corpus with over 590K words, enabling improved gender bias mitigation in NLP, especially for morphologically rich languages, through datasets supporting gender identification and rewriting.
Contribution
The paper presents APGC v2.0, an extended, publicly available corpus for Arabic gender analysis, including second person targets and increased data, advancing research in gender bias mitigation.
Findings
Expanded corpus with 6.5 times more sentences
Includes second person gender targets
Supports gender identification and rewriting tasks
Abstract
Gender bias in natural language processing (NLP) applications, particularly machine translation, has been receiving increasing attention. Much of the research on this issue has focused on mitigating gender bias in English NLP models and systems. Addressing the problem in poorly resourced, and/or morphologically rich languages has lagged behind, largely due to the lack of datasets and resources. In this paper, we introduce a new corpus for gender identification and rewriting in contexts involving one or two target users (I and/or You) -- first and second grammatical persons with independent grammatical gender preferences. We focus on Arabic, a gender-marking morphologically rich language. The corpus has multiple parallel components: four combinations of 1st and 2nd person in feminine and masculine grammatical genders, as well as English, and English to Arabic machine translation output.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
