Language ID in the Wild: Unexpected Challenges on the Path to a   Thousand-Language Web Text Corpus

Isaac Caswell; Theresa Breiner; Daan van Esch; Ankur Bapna

arXiv:2010.14571·cs.CL·October 30, 2020

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Isaac Caswell, Theresa Breiner, Daan van Esch, Ankur Bapna

PDF

1 Repo 5 Datasets

TL;DR

This paper investigates the challenges of language identification in web text corpora, revealing significant accuracy issues for low-resource languages and proposing techniques to improve dataset quality for a large-scale multilingual web corpus.

Contribution

It identifies key error modes in existing LangID models and introduces wordlist filters and semi-supervised transformer models to significantly enhance language detection accuracy.

Findings

01

Human-judged accuracy for web-crawl text is only around 5% for many low-resource languages.

02

Proposed techniques increase median dataset precision from 5.5% to 71.2%.

03

Enabled creation of a large, relatively clean multilingual web text dataset.

Abstract

Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research-datasets/TF-IDF-IIF-top100-wordlists
tfOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.