How Should We Model the Probability of a Language?
Rasul Dent, Pedro Ortiz Suarez, Thibault Cl\'erice, Beno\^it Sagot

TL;DR
This paper critiques current language identification systems for their limited coverage and argues for a paradigm shift towards contextual, cue-based modeling to better identify underrepresented languages.
Contribution
It proposes rethinking language identification as a routing problem that incorporates environmental cues, moving beyond fixed-prior models.
Findings
Current systems cover only a few hundred languages reliably.
Reframing LID as a routing problem can improve tail language coverage.
Incorporating environmental cues makes language identification more contextually plausible.
Abstract
Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAuthorship Attribution and Profiling · Language and cultural evolution · Natural Language Processing Techniques
