# UzbekPOS: A multi-domain dataset for Uzbek part-of-speech tagging

**Authors:** Maksud Sharipov, Elmurod Kuriyozov, Jernej Vičič

PMC · DOI: 10.1016/j.dib.2026.112640 · Data in Brief · 2026-02-28

## TL;DR

UzbekPOS is a large, manually annotated dataset for the Uzbek language, designed to support natural language processing and linguistic research.

## Contribution

UzbekPOS is the largest publicly available POS-tagged corpus for Uzbek, with a fine-grained tagset tailored to its morphological and syntactic features.

## Key findings

- The dataset contains 4.5K sentences and 53K token/tag pairs, annotated by professional annotators.
- It includes a diverse range of text sources to reflect linguistic and topical diversity in Uzbek.
- The dataset supports multiple formats (txt, TSV, JSON, conllu) and can be used for training models and linguistic studies.

## Abstract

In this paper, we introduce UzbekPOS — a part-of-speech (POS) tagged dataset manually annotated for the Uzbek language, designed for natural language processing, artificial intelligence models, and corpus linguistics applications. This tagged corpus is currently the largest publicly available POS-tagged corpus for the Uzbek language. The dataset comprises sentences drawn from a diverse range of Uzbek text sources, including literature, news outlets, science, education, and public speaking, to reflect linguistic and topical diversity. The sentences are tokenized and annotated by professional annotators, utilizing a finely grained POS tagset which integrates standard Universal Dependencies with additional labels that are specific to the morphological and syntactic features of the Uzbek language, comprising 16 tags in total.

The UzbekPOS contains almost 4.5K sentences and more than 53K token/tag pairs, with each annotation cross-verified by at least two annotators for highest reliability. It also comes with both raw (txt) and generally accepted formats of distribution (TSV, JSON), as well as the universal POS-tagging format (conllu). This resource is one of the first and the largest openly published POS-tagged dataset for Uzbek, an under-resourced and morphologically complex Turkic language. This dataset can also act as a key foundation for training POS taggers, as a test set for machine learning models, and as a source for linguistic studies. The resource also bears the reusability potential for tasks of related kinds, such as morphological analysis, syntactic parsing, and transfer learning across languages of the Turkic family. Furthermore, this dataset can serve as seed material for creating similar corpora of POS for other Turkic languages and can help conduct cross-linguistic analyses and tool building.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12993376/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12993376/full.md

## References

16 references — full list in the complete paper: https://tomesphere.com/paper/PMC12993376/full.md

---
Source: https://tomesphere.com/paper/PMC12993376