Language Identification with a Reciprocal Rank Classifier

Dominic Widdows; Chris Brew

arXiv:2109.09862·cs.CL·September 22, 2021

Language Identification with a Reciprocal Rank Classifier

Dominic Widdows, Chris Brew

PDF

Open Access

TL;DR

The paper introduces the Reciprocal Rank Classifier (RRC), a lightweight language identification method that is robust to domain changes and requires minimal training data, outperforming some established systems especially on short texts.

Contribution

The paper proposes the RRC, a novel language classifier based on reciprocal rank features, demonstrating improved robustness and domain adaptation over existing methods.

Findings

01

RRC maintains high accuracy when applied across domains.

02

RRC outperforms fastText and langid on short texts and Twitter data.

03

Adding domain-specific words improves RRC accuracy in conversational settings.

Abstract

Language identification is a critical component of language processing pipelines (Jauhiainen et al.,2019) and is not a solved problem in real-world settings. We present a lightweight and effective language identifier that is robust to changes of domain and to the absence of copious training data. The key idea for classification is that the reciprocal of the rank in a frequency table makes an effective additive feature score, hence the term Reciprocal Rank Classifier (RRC). The key finding for language classification is that ranked lists of words and frequencies of characters form a sufficient and robust representation of the regularities of key languages and their orthographies. We test this on two 22-language data sets and demonstrate zero-effort domain adaptation from a Wikipedia training set to a Twitter test set. When trained on Wikipedia but applied to Twitter the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Text Readability and Simplification

MethodsTest · Support Vector Machine · fastText