Language Identification for Austronesian Languages
Jonathan Dunn, Wikke Nijhof

TL;DR
This paper develops and evaluates language identification models for Austronesian languages, demonstrating that a skip-gram embedding classifier outperforms others and remains robust with larger language inventories, also enabling effective code-switching detection.
Contribution
Introduces new language identification models for under-resourced Austronesian languages, showing the effectiveness of skip-gram embeddings and robustness to large language inventories.
Findings
Skip-gram embedding classifier outperforms other methods.
Increasing non-Austronesian languages has minimal impact on accuracy.
Models achieve high accuracy in code-switching detection.
Abstract
This paper provides language identification models for low- and under-resourced languages in the Pacific region with a focus on previously unavailable Austronesian languages. Accurate language identification is an important part of developing language resources. The approach taken in this paper combines 29 Austronesian languages with 171 non-Austronesian languages to create an evaluation set drawn from eight data sources. After evaluating six approaches to language identification, we find that a classifier based on skip-gram embeddings reaches a significantly higher performance than alternate methods. We then systematically increase the number of non-Austronesian languages in the model up to a total of 800 languages to evaluate whether an increased language inventory leads to less precise predictions for the Austronesian languages of interest. This evaluation finds that there is only a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Hate Speech and Cyberbullying Detection
