Automatic Language Identification for Celtic Texts

Olha Dovbnia; Anna Wr\'oblewska

arXiv:2203.04831·cs.CL·March 10, 2022·1 cites

Automatic Language Identification for Celtic Texts

Olha Dovbnia, Anna Wr\'oblewska

PDF

Open Access

TL;DR

This paper develops a classification approach for Celtic languages using supervised and unsupervised features, demonstrating high accuracy and robustness with limited labeled data.

Contribution

It introduces a new dataset for Celtic languages and evaluates unsupervised feature extraction methods, improving low-resource language identification.

Findings

01

Unsupervised features enhance classification performance.

02

Dense neural networks outperform SVMs.

03

Unsupervised features are robust with less labeled data.

Abstract

Language identification is an important Natural Language Processing task. It has been thoroughly researched in the literature. However, some issues are still open. This work addresses the identification of the related low-resource languages on the example of the Celtic language family. This work's main goals were: (1) to collect the dataset of three Celtic languages; (2) to prepare a method to identify the languages from the Celtic family, i.e. to train a successful classification model; (3) to evaluate the influence of different feature extraction methods, and explore the applicability of the unsupervised models as a feature extraction technique; (4) to experiment with the unsupervised feature extraction on a reduced annotated set. We collected a new dataset including Irish, Scottish, Welsh and English records. We tested supervised models such as SVM and neural networks with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Translation Studies and Practices

MethodsSupport Vector Machine