Test Set Quality in Multilingual LLM Evaluation

Chalamalasetti Kranti; Gabriel Bernier-Colborne; Yvan Gauthier; Sowmya Vajjala

arXiv:2508.02635·cs.CL·November 14, 2025

Test Set Quality in Multilingual LLM Evaluation

Chalamalasetti Kranti, Gabriel Bernier-Colborne, Yvan Gauthier, Sowmya Vajjala

PDF

Open Access

TL;DR

This paper critically examines the quality of multilingual benchmark datasets for LLM evaluation, revealing significant errors and performance discrepancies, and advocates for ongoing dataset revision and quality assurance.

Contribution

It provides a manual error analysis of multilingual test sets in French and Telugu, highlighting the importance of dataset quality and suggesting revisions for more reliable evaluation.

Findings

01

Large performance differences when using original vs. revised datasets

02

Identification of numerous errors in semi-automatically created datasets

03

Recommendations for dataset creators and users to improve test set quality

Abstract

Several multilingual benchmark datasets have been developed in a semi-automatic manner in the recent past to measure progress and understand the state-of-the-art in the multilingual capabilities of Large Language Models. However, there is not a lot of attention paid to the quality of the datasets themselves, despite the existence of previous work in identifying errors in even fully human-annotated test sets. In this paper, we manually analyze recent multilingual evaluation sets in two languages - French and Telugu, identifying several errors in the process. We compare the performance difference across several LLMs with the original and revised versions of the datasets and identify large differences (almost 10% in some cases) in both languages). Based on these results, we argue that test sets should not be considered immutable and should be revisited, checked for correctness, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification