The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses

Bashar Alhafni; Nizar Habash; Houda Bouamor

arXiv:2110.09216·cs.CL·October 19, 2021

The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses

Bashar Alhafni, Nizar Habash, Houda Bouamor

PDF

TL;DR

This paper introduces an expanded Arabic gender corpus with over 590K words, enabling improved gender bias mitigation in NLP, especially for morphologically rich languages, through datasets supporting gender identification and rewriting.

Contribution

The paper presents APGC v2.0, an extended, publicly available corpus for Arabic gender analysis, including second person targets and increased data, advancing research in gender bias mitigation.

Findings

01

Expanded corpus with 6.5 times more sentences

02

Includes second person gender targets

03

Supports gender identification and rewriting tasks

Abstract

Gender bias in natural language processing (NLP) applications, particularly machine translation, has been receiving increasing attention. Much of the research on this issue has focused on mitigating gender bias in English NLP models and systems. Addressing the problem in poorly resourced, and/or morphologically rich languages has lagged behind, largely due to the lack of datasets and resources. In this paper, we introduce a new corpus for gender identification and rewriting in contexts involving one or two target users (I and/or You) -- first and second grammatical persons with independent grammatical gender preferences. We focus on Arabic, a gender-marking morphologically rich language. The corpus has multiple parallel components: four combinations of 1st and 2nd person in feminine and masculine grammatical genders, as well as English, and English to Arabic machine translation output.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.