Towards Massive Multilingual Holistic Bias

Xiaoqing Ellen Tan; Prangthip Hansanti; Carleigh Wood; Bokai; Yu; Christophe Ropers; Marta R. Costa-juss\`a

arXiv:2407.00486·cs.CL·July 2, 2024

Towards Massive Multilingual Holistic Bias

Xiaoqing Ellen Tan, Prangthip Hansanti, Carleigh Wood, Bokai, Yu, Christophe Ropers, Marta R. Costa-juss\`a

PDF

Open Access 1 Video

TL;DR

This paper introduces the MASSIVE MULTILINGUAL HOLISTICBIAS dataset with 6 million sentences across 13 demographic axes, providing a benchmark for evaluating and mitigating biases in multilingual language models.

Contribution

It presents a scalable, multilingual dataset construction methodology and demonstrates its use in analyzing gender bias and toxicity in machine translation.

Findings

01

Gender bias shows +4 chrf points for masculine sentences.

02

Models overgeneralize to masculine forms, with +12 chrf points.

03

Toxicity increases up to 2.3% in biased translations.

Abstract

In the current landscape of automatic language generation, there is a need to understand, evaluate, and mitigate demographic biases as existing models are becoming increasingly multilingual. To address this, we present the initial eight languages from the MASSIVE MULTILINGUAL HOLISTICBIAS (MMHB) dataset and benchmark consisting of approximately 6 million sentences representing 13 demographic axes. We propose an automatic construction methodology to further scale up MMHB sentences in terms of both language coverage and size, leveraging limited human annotation. Our approach utilizes placeholders in multilingual sentence construction and employs a systematic method to independently translate sentence patterns, nouns, and descriptors. Combined with human translation, this technique carefully designs placeholders to dynamically generate multiple sentence variations and significantly reduces…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Towards Massive Multilingual Holistic Bias· underline

Taxonomy

Topicslinguistics and terminology studies · Interpreting and Communication in Healthcare · Translation Studies and Practices