Exploring the Impact of Training Data Distribution and Subword   Tokenization on Gender Bias in Machine Translation

Bar Iluz; Tomasz Limisiewicz; Gabriel Stanovsky; David Mare\v{c}ek

arXiv:2309.12491·cs.CL·October 3, 2023

Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation

Bar Iluz, Tomasz Limisiewicz, Gabriel Stanovsky, David Mare\v{c}ek

PDF

Open Access 1 Repo

TL;DR

This paper investigates how tokenization and training data gender distribution influence gender bias in machine translation, revealing that data imbalance is a major factor and that targeted fine-tuning can reduce bias.

Contribution

It highlights the impact of gender form imbalance and subword tokenization on bias, proposing analysis methods and a fine-tuning approach to mitigate gender bias.

Findings

01

Gender form imbalance significantly contributes to bias.

02

Subword splitting correlates with gender bias estimation.

03

Fine-tuning token embeddings reduces gender prediction gap.

Abstract

We study the effect of tokenization on gender bias in machine translation, an aspect that has been largely overlooked in previous works. Specifically, we focus on the interactions between the frequency of gendered profession names in training data, their representation in the subword tokenizer's vocabulary, and gender bias. We observe that female and non-stereotypical gender inflections of profession names (e.g., Spanish "doctora" for "female doctor") tend to be split into multiple subword tokens. Our results indicate that the imbalance of gender forms in the model's training corpus is a major factor contributing to gender bias and has a greater impact than subword splitting. We show that analyzing subword splits provides good estimates of gender-form imbalance in the training data and can be used even when the corpus is not publicly available. We also demonstrate that fine-tuning just…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tomlimi/MT-Tokenizer-Bias
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Gender Studies in Language

MethodsFocus