Script-Agnostic Language Identification
Milind Agarwal, Joshua Otten, Antonios Anastasopoulos

TL;DR
This paper introduces a method for language identification that is robust across different scripts by learning script-agnostic representations, especially useful for languages with multiple scripts like those in the Indian Subcontinent.
Contribution
The paper proposes novel experimental strategies to learn script-agnostic representations, improving language identification for multilingual and multi-script languages.
Findings
Word-level script randomization enhances script-agnostic identification.
Exposure to multiple scripts improves downstream performance.
Method maintains competitive accuracy on natural text.
Abstract
Language identification is used as the first step in many data collection and crawling efforts because it allows us to sort online text into language-specific buckets. However, many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts. Moreover, languages with different writing systems do not share significant lexical, semantic, and syntactic properties in neural representation spaces, which is a disadvantage for closely related languages and low-resource languages, especially those from the Indian Subcontinent. To counter this, we propose learning script-agnostic representations using several different experimental strategies (upscaling, flattening, and script mixing) focusing on four major Dravidian languages (Tamil, Telugu, Kannada, and Malayalam). We find that word-level script randomization and exposure to a language written in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Swearing, Euphemism, Multilingualism
