CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
Pedro Ortiz Suarez, Laurie Burchell, Catherine Arnett, Rafael Mosquera-G\'omez, Sara Hincapie-Monsalve, Thom Vaughan, Damian Stewart, Malte Ostendorff, Idris Abdulmumin, Vukosi Marivate, Shamsuddeen Hassan Muhammad, Atnafu Lambebo Tonja, Hend Al-Khalifa, Nadia Ghezaiel Hammouda

TL;DR
CommonLID is a new, human-annotated benchmark for language identification on web data, covering 109 languages, revealing that current models often overestimate accuracy, especially for under-served languages.
Contribution
We introduce CommonLID, a comprehensive, community-driven benchmark for web-based language identification covering 109 languages, addressing gaps in existing evaluation datasets.
Findings
Existing evaluations overestimate LID accuracy for many web languages.
CommonLID reveals underperformance of popular LID models on web data.
Many languages remain under-served in current LID benchmarks.
Abstract
Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Text Readability and Simplification · Hate Speech and Cyberbullying Detection
