Geographically-Informed Language Identification

Jonathan Dunn; Lane Edwards-Brown

arXiv:2403.09892·cs.CL·March 18, 2024·1 cites

Geographically-Informed Language Identification

Jonathan Dunn, Lane Edwards-Brown

PDF

Open Access 1 Repo

TL;DR

This paper introduces a geographically-informed approach to language identification, creating region-specific models that improve accuracy by leveraging geographic context, especially in social media data.

Contribution

It develops 16 region-specific language models that incorporate local and international languages, significantly enhancing identification accuracy over traditional methods.

Findings

01

F-score improvement of up to 10.4 points in North Africa

02

Enhanced accuracy in social media language labeling

03

Coverage of 916 languages with 50-character samples

Abstract

This paper develops an approach to language identification in which the set of languages considered by the model depends on the geographic origin of the text in question. Given that many digital corpora can be geo-referenced at the country level, this paper formulates 16 region-specific models, each of which contains the languages expected to appear in countries within that region. These regional models also each include 31 widely-spoken international languages in order to ensure coverage of these linguae francae regardless of location. An upstream evaluation using traditional language identification testing data shows an improvement in f-score ranging from 1.7 points (Southeast Asia) to as much as 10.4 points (North Africa). A downstream evaluation on social media data shows that this improved performance has a significant impact on the language labels which are applied to large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jonathandunn/geolid
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Names, Identity, and Discrimination Research · Linguistic Variation and Morphology

MethodsSparse Evolutionary Training