On the Evaluation Practices in Multilingual NLP: Can Machine Translation Offer an Alternative to Human Translations?
Rochelle Choenni, Sara Rajaee, Christof Monz, Ekaterina Shutova

TL;DR
This paper critically examines evaluation practices in multilingual NLP, highlighting limitations and proposing machine translation as a scalable alternative for assessing models across many languages, especially low-resource ones.
Contribution
It analyzes current evaluation frameworks, discusses their shortcomings, and empirically demonstrates the potential and limitations of using machine translation for large-scale multilingual model assessment.
Findings
High-resource language subsets are representative of broader language groups.
Evaluation often overestimates MLM performance on low-resource languages.
Simple baselines can perform well without extensive multilingual pretraining.
Abstract
While multilingual language models (MLMs) have been trained on 100+ languages, they are typically only evaluated across a handful of them due to a lack of available test data in most languages. This is particularly problematic when assessing MLM's potential for low-resource and unseen languages. In this paper, we present an analysis of existing evaluation frameworks in multilingual NLP, discuss their limitations, and propose several directions for more robust and reliable evaluation practices. Furthermore, we empirically study to what extent machine translation offers a {reliable alternative to human translation} for large-scale evaluation of MLMs across a wide set of languages. We use a SOTA translation model to translate test data from 4 tasks to 198 languages and use them to evaluate three MLMs. We show that while the selected subsets of high-resource test languages are generally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training
