A Principled Framework for Evaluating on Typologically Diverse Languages
Esther Ploeger, Wessel Poelman, Andreas Holck H{\o}eg-Petersen, Anders Schlichtkrull, Miryam de Lhoneux, Johannes Bjerva

TL;DR
This paper introduces a systematic framework for selecting typologically diverse languages in multilingual NLP evaluation, demonstrating that better sampling improves model generalizability across languages.
Contribution
The authors propose a new language sampling framework based on typological features, improving upon previous methods for selecting diverse languages in NLP evaluation.
Findings
Systematic sampling methods retrieve more typologically diverse languages.
Diverse language sampling enhances the evaluation of multilingual models.
Sampling impacts the perceived generalizability of NLP systems.
Abstract
Beyond individual languages, multilingual natural language processing (NLP) research increasingly aims to develop models that perform well across languages generally. However, evaluating these systems on all the world's languages is practically infeasible. To attain generalizability, representative language sampling is essential. Previous work argues that generalizable multilingual evaluation sets should contain languages with diverse typological properties. However, 'typologically diverse' language samples have been found to vary considerably in this regard, and popular sampling methods are flawed and inconsistent. We present a language sampling framework for selecting highly typologically diverse languages given a sampling frame, informed by language typology. We compare sampling methods with a range of metrics and find that our systematic methods consistently retrieve more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLinguistics, Language Diversity, and Identity
