Automatic Language Identification for Celtic Texts
Olha Dovbnia, Anna Wr\'oblewska

TL;DR
This paper develops a classification approach for Celtic languages using supervised and unsupervised features, demonstrating high accuracy and robustness with limited labeled data.
Contribution
It introduces a new dataset for Celtic languages and evaluates unsupervised feature extraction methods, improving low-resource language identification.
Findings
Unsupervised features enhance classification performance.
Dense neural networks outperform SVMs.
Unsupervised features are robust with less labeled data.
Abstract
Language identification is an important Natural Language Processing task. It has been thoroughly researched in the literature. However, some issues are still open. This work addresses the identification of the related low-resource languages on the example of the Celtic language family. This work's main goals were: (1) to collect the dataset of three Celtic languages; (2) to prepare a method to identify the languages from the Celtic family, i.e. to train a successful classification model; (3) to evaluate the influence of different feature extraction methods, and explore the applicability of the unsupervised models as a feature extraction technique; (4) to experiment with the unsupervised feature extraction on a reduced annotated set. We collected a new dataset including Irish, Scottish, Welsh and English records. We tested supervised models such as SVM and neural networks with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Translation Studies and Practices
MethodsSupport Vector Machine
