Test Set Quality in Multilingual LLM Evaluation
Chalamalasetti Kranti, Gabriel Bernier-Colborne, Yvan Gauthier, Sowmya Vajjala

TL;DR
This paper critically examines the quality of multilingual benchmark datasets for LLM evaluation, revealing significant errors and performance discrepancies, and advocates for ongoing dataset revision and quality assurance.
Contribution
It provides a manual error analysis of multilingual test sets in French and Telugu, highlighting the importance of dataset quality and suggesting revisions for more reliable evaluation.
Findings
Large performance differences when using original vs. revised datasets
Identification of numerous errors in semi-automatically created datasets
Recommendations for dataset creators and users to improve test set quality
Abstract
Several multilingual benchmark datasets have been developed in a semi-automatic manner in the recent past to measure progress and understand the state-of-the-art in the multilingual capabilities of Large Language Models. However, there is not a lot of attention paid to the quality of the datasets themselves, despite the existence of previous work in identifying errors in even fully human-annotated test sets. In this paper, we manually analyze recent multilingual evaluation sets in two languages - French and Telugu, identifying several errors in the process. We compare the performance difference across several LLMs with the original and revised versions of the datasets and identify large differences (almost 10% in some cases) in both languages). Based on these results, we argue that test sets should not be considered immutable and should be revisited, checked for correctness, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
