TajikNLP: An Open-Source Toolkit for Comprehensive Text Processing of Tajik (Cyrillic Script)
Mullosharaf K. Arabov

TL;DR
TajikNLP is an open-source, comprehensive NLP toolkit for Tajik language processing in Cyrillic script, including morphological analysis, tokenization, sentiment analysis, and pre-trained embeddings, with datasets and extensive testing.
Contribution
It introduces the first modular, open-source Tajik NLP pipeline with a novel morphology engine and publicly available datasets, advancing low-resource language processing.
Findings
Achieves 93% source code test coverage.
Provides pre-trained embeddings from Hugging Face Hub.
Includes four linguistic datasets for Tajik NLP.
Abstract
The Tajik language, written in Cyrillic script, remains severely under-resourced in terms of publicly available natural language processing (NLP) toolkits, hindering both linguistic research and applied development. This paper introduces TajikNLP, an open-source Python library that provides the first comprehensive pipeline for processing authentic Tajik text while preserving the original Cyrillic orthography. The library implements a modular architecture centered around a unified Doc object, enabling sequential application of components for cleaning, normalization, tokenization (including subword BPE), morphemic segmentation, part-of-speech tagging, stemming, lemmatization, and sentence splitting. A novel unified morphology engine is introduced, offering controlled and deep analysis modes that significantly improve handling of Tajik's agglutinative nominal and verbal inflections. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
