A Tale of Pronouns: Interpretability Informs Gender Bias Mitigation for   Fairer Instruction-Tuned Machine Translation

Giuseppe Attanasio; Flor Miriam Plaza-del-Arco; Debora Nozza; Anne; Lauscher

arXiv:2310.12127·cs.CL·October 26, 2023·1 cites

A Tale of Pronouns: Interpretability Informs Gender Bias Mitigation for Fairer Instruction-Tuned Machine Translation

Giuseppe Attanasio, Flor Miriam Plaza-del-Arco, Debora Nozza, Anne, Lauscher

PDF

Open Access 1 Repo 2 Datasets

TL;DR

This paper investigates gender bias in instruction-tuned machine translation models, revealing systematic male bias and proposing an interpretability-driven mitigation method that improves fairness in translations.

Contribution

It introduces a novel interpretability-based approach to identify and mitigate gender bias in instruction-tuned machine translation models.

Findings

01

Models default to male-inflected translations even with female stereotypes

02

Interpretability methods reveal models overlook gendered pronouns in translations

03

Few-shot learning-based mitigation significantly reduces gender bias

Abstract

Recent instruction fine-tuned models can solve multiple NLP tasks when prompted to do so, with machine translation (MT) being a prominent use case. However, current research often focuses on standard performance benchmarks, leaving compelling fairness and ethical considerations behind. In MT, this might lead to misgendered translations, resulting, among other harms, in the perpetuation of stereotypes and prejudices. In this work, we address this gap by investigating whether and to what extent such models exhibit gender bias in machine translation and how we can mitigate it. Concretely, we compute established gender bias metrics on the WinoMT corpus from English to German and Spanish. We discover that IFT models default to male-inflected translations, even disregarding female occupational stereotypes. Next, using interpretability methods, we unveil that models systematically overlook the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

milanlproc/interpretability-mt-gender-bias
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Hate Speech and Cyberbullying Detection