How Should We Model the Probability of a Language?

Rasul Dent; Pedro Ortiz Suarez; Thibault Cl\'erice; Beno\^it Sagot

arXiv:2602.08951·cs.CL·February 10, 2026

How Should We Model the Probability of a Language?

Rasul Dent, Pedro Ortiz Suarez, Thibault Cl\'erice, Beno\^it Sagot

PDF

Open Access 1 Video

TL;DR

This paper critiques current language identification systems for their limited coverage and argues for a paradigm shift towards contextual, cue-based modeling to better identify underrepresented languages.

Contribution

It proposes rethinking language identification as a routing problem that incorporates environmental cues, moving beyond fixed-prior models.

Findings

01

Current systems cover only a few hundred languages reliably.

02

Reframing LID as a routing problem can improve tail language coverage.

03

Incorporating environmental cues makes language identification more contextually plausible.

Abstract

Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

How Should We Model the Probability of a Language?· underline

Taxonomy

TopicsAuthorship Attribution and Profiling · Language and cultural evolution · Natural Language Processing Techniques