XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation
Sebastian Ruder, Noah Constant, Jan Botha, Aditya Siddhant, Orhan, Firat, Jinlan Fu, Pengfei Liu, Junjie Hu, Dan Garrette, Graham Neubig, Melvin, Johnson

TL;DR
XTREME-R advances multilingual NLP evaluation by introducing more challenging tasks, a broader language set, and diagnostic tools to better understand model capabilities and limitations.
Contribution
The paper extends the XTREME benchmark to XTREME-R, adding difficult tasks, more languages, and diagnostic tools for comprehensive multilingual model evaluation.
Findings
Significant performance improvements on XTREME benchmark.
Identification of challenges in cross-lingual transfer learning.
Enhanced understanding of model strengths and weaknesses.
Abstract
Machine learning has brought striking advances in multilingual natural language processing capabilities over the past year. For example, the latest techniques have improved the state-of-the-art performance on the XTREME multilingual benchmark by more than 13 points. While a sizeable gap to human-level performance remains, improvements have been easier to achieve in some tasks than in others. This paper analyzes the current state of cross-lingual transfer learning and summarizes some lessons learned. In order to catalyze meaningful progress, we extend XTREME to XTREME-R, which consists of an improved set of ten natural language understanding tasks, including challenging language-agnostic retrieval tasks, and covers 50 typologically diverse languages. In addition, we provide a massively multilingual diagnostic suite (MultiCheckList) and fine-grained multi-dataset evaluation capabilities…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
