Benchmarking POS Tagging for the Tajik Language: A Comparative Study of Neural Architectures on the TajPersParallel Corpus
Mullosharaf K. Arabov

TL;DR
This study benchmarks neural architectures for Tajik POS tagging using the TajPersParallel corpus, revealing the strengths of multilingual models like mBERT with LoRA in context-independent classification tasks.
Contribution
It provides the first systematic comparison of neural models for Tajik POS tagging, highlighting the effectiveness of multilingual transformers and the impact of context on morphological analysis.
Findings
mBERT + LoRA achieved the highest macro F1-score of 0.11 and weighted F1-score of 0.62.
Models struggled with rare function words due to lack of syntactic context.
Zero-shot evaluation showed Tajik's closest typological relations to Persian and Russian.
Abstract
This paper presents the first benchmark for the task of automatic part-of-speech (POS) tagging for the Tajik language. Despite the existence of multilingual language models demonstrating high effectiveness for many of the world's languages, their capacity for grammatical analysis of Tajik has remained unexplored until now. The aim of this study is to fill this gap through a systematic comparison of classical neural network architectures and modern multilingual transformers. Experiments were conducted on the TajPersParallel corpus, a parallel lexical resource comprising approximately 44,000 dictionary entries. Due to the absence of full-fledged example sentences in the current version of the corpus, the task was performed at the level of isolated lexical units, representing a challenging case of context-independent classification. The study compares the following architectures: a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
