TurkicNLP: An NLP Toolkit for Turkic Languages
Sherzod Hakimov

TL;DR
TurkicNLP is an open-source Python toolkit that provides a comprehensive, unified NLP pipeline for Turkic languages, supporting multiple scripts and various NLP tasks through a modular, script-agnostic architecture.
Contribution
It introduces the first unified, multi-script NLP toolkit for Turkic languages, integrating rule-based and neural models with a standardized API and output format.
Findings
Supports four script families: Latin, Cyrillic, Perso-Arabic, Old Turkic Runic
Provides a full NLP pipeline including tokenization, parsing, NER, and translation
Enables cross-lingual and script-agnostic NLP for Turkic languages
Abstract
Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Language and cultural evolution · Linguistics and Cultural Studies
