Geographically-Informed Language Identification
Jonathan Dunn, Lane Edwards-Brown

TL;DR
This paper introduces a geographically-informed approach to language identification, creating region-specific models that improve accuracy by leveraging geographic context, especially in social media data.
Contribution
It develops 16 region-specific language models that incorporate local and international languages, significantly enhancing identification accuracy over traditional methods.
Findings
F-score improvement of up to 10.4 points in North Africa
Enhanced accuracy in social media language labeling
Coverage of 916 languages with 50-character samples
Abstract
This paper develops an approach to language identification in which the set of languages considered by the model depends on the geographic origin of the text in question. Given that many digital corpora can be geo-referenced at the country level, this paper formulates 16 region-specific models, each of which contains the languages expected to appear in countries within that region. These regional models also each include 31 widely-spoken international languages in order to ensure coverage of these linguae francae regardless of location. An upstream evaluation using traditional language identification testing data shows an improvement in f-score ranging from 1.7 points (Southeast Asia) to as much as 10.4 points (North Africa). A downstream evaluation on social media data shows that this improved performance has a significant impact on the language labels which are applied to large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Names, Identity, and Discrimination Research · Linguistic Variation and Morphology
MethodsSparse Evolutionary Training
