Global MMLU: Understanding and Addressing Cultural and Linguistic Biases   in Multilingual Evaluation

Shivalika Singh; Angelika Romanou; Cl\'ementine Fourrier; David I.; Adelani; Jian Gang Ngui; Daniel Vila-Suero; Peerat Limkonchotiwat; Kelly; Marchisio; Wei Qi Leong; Yosephine Susanto; Raymond Ng; Shayne Longpre,; Wei-Yin Ko; Sebastian Ruder; Madeline Smith; Antoine Bosselut; Alice Oh,; Andre F. T. Martins; Leshem Choshen; Daphne Ippolito; Enzo Ferrante; Marzieh; Fadaee; Beyza Ermis; Sara Hooker

arXiv:2412.03304·cs.CL·February 20, 2025·2 cites

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Shivalika Singh, Angelika Romanou, Cl\'ementine Fourrier, David I., Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly, Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre,, Wei-Yin Ko, Sebastian Ruder, Madeline Smith, Antoine Bosselut

PDF

Open Access 5 Datasets 1 Video

TL;DR

This paper investigates cultural and linguistic biases in multilingual datasets like MMLU, highlighting their impact on model evaluation and introducing Global MMLU, a more culturally aware benchmark across 42 languages.

Contribution

It identifies biases in existing multilingual benchmarks and presents Global MMLU, an improved, culturally sensitive evaluation dataset with broader language coverage and bias annotations.

Findings

01

28% of questions require culturally sensitive knowledge

02

84.9% of geographic questions focus on North America or Europe

03

Model rankings vary significantly when considering culturally biased subsets

Abstract

Cultural biases in multilingual datasets pose significant challenges for their effectiveness as global benchmarks. These biases stem not only from differences in language but also from the cultural knowledge required to interpret questions, reducing the practical utility of translated datasets like MMLU. Furthermore, translation often introduces artefacts that can distort the meaning or clarity of questions in the target language. A common practice in multilingual evaluation is to rely on machine-translated evaluation sets, but simply translating a dataset is insufficient to address these challenges. In this work, we trace the impact of both of these issues on multilingual evaluations and ensuing model performances. Our large-scale evaluation of state-of-the-art open and proprietary models illustrates that progress on MMLU depends heavily on learning Western-centric concepts, with 28%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation· underline

Taxonomy

TopicsSecond Language Learning and Teaching

Methods7 Fastest Ways to Call American Airlines Reservations Number (USA Guide) · Sparse Evolutionary Training · Focus