TajikNLP: An Open-Source Toolkit for Comprehensive Text Processing of Tajik (Cyrillic Script)

Mullosharaf K. Arabov

arXiv:2605.04583·cs.CL·May 7, 2026

TajikNLP: An Open-Source Toolkit for Comprehensive Text Processing of Tajik (Cyrillic Script)

Mullosharaf K. Arabov

PDF

TL;DR

TajikNLP is an open-source, comprehensive NLP toolkit for Tajik language processing in Cyrillic script, including morphological analysis, tokenization, sentiment analysis, and pre-trained embeddings, with datasets and extensive testing.

Contribution

It introduces the first modular, open-source Tajik NLP pipeline with a novel morphology engine and publicly available datasets, advancing low-resource language processing.

Findings

01

Achieves 93% source code test coverage.

02

Provides pre-trained embeddings from Hugging Face Hub.

03

Includes four linguistic datasets for Tajik NLP.

Abstract

The Tajik language, written in Cyrillic script, remains severely under-resourced in terms of publicly available natural language processing (NLP) toolkits, hindering both linguistic research and applied development. This paper introduces TajikNLP, an open-source Python library that provides the first comprehensive pipeline for processing authentic Tajik text while preserving the original Cyrillic orthography. The library implements a modular architecture centered around a unified Doc object, enabling sequential application of components for cleaning, normalization, tokenization (including subword BPE), morphemic segmentation, part-of-speech tagging, stemming, lemmatization, and sentence splitting. A novel unified morphology engine is introduced, offering controlled and deep analysis modes that significantly improve handling of Tajik's agglutinative nominal and verbal inflections. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.