What Language is This? Ask Your Tokenizer
Clara Meister, Ahmetcan Yavuz, Pietro Lesci, Tiago Pimentel

TL;DR
UniLID is a novel language identification method based on UnigramLM tokenization that is efficient, adaptable, and highly effective, especially in low-resource and dialect identification scenarios.
Contribution
It introduces UniLID, a new LID approach leveraging UnigramLM tokenization, supporting incremental learning and integration into existing pipelines, with superior performance in low-resource settings.
Findings
Achieves over 70% accuracy with only five samples per language
Outperforms baselines like fastText, GlotLID, and CLD3 on benchmarks
Significantly improves dialect identification accuracy
Abstract
Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models. Despite near-perfect performance on high-resource languages, existing systems remain brittle in low-resource and closely related language settings. We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm, leveraging its probabilistic framing, parameter estimation technique and inference strategy. In short, we learn language-conditional unigram distributions over a shared tokenizer vocabulary but treat segmentation as a language-specific phenomenon. Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models, and can naturally be integrated into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Linguistic Variation and Morphology
