Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek

Mukhammadsaid Mamasaidov; Azizullah Aral; Abror Shopulatov; Mironshoh Inomjonov

arXiv:2508.14586·cs.CL·August 21, 2025

Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek

Mukhammadsaid Mamasaidov, Azizullah Aral, Abror Shopulatov, Mironshoh Inomjonov

PDF

Open Access 1 Models 1 Datasets 1 Video

TL;DR

This paper introduces new translation resources and a model for Southern Uzbek, a low-resource Turkic language, including datasets, a fine-tuned model, and a post-processing method to enhance translation quality.

Contribution

It provides the first comprehensive resources and a specialized model for Southern Uzbek, addressing its underrepresentation in NLP.

Findings

01

New parallel datasets and dev set released.

02

A fine-tuned NLLB-200 model for Southern Uzbek.

03

Post-processing improves morphological boundary handling.

Abstract

Southern Uzbek (uzs) is a Turkic language variety spoken by around 5 million people in Afghanistan and differs significantly from Northern Uzbek (uzn) in phonology, lexicon, and orthography. Despite the large number of speakers, Southern Uzbek is underrepresented in natural language processing. We present new resources for Southern Uzbek machine translation, including a 997-sentence FLORES+ dev set, 39,994 parallel sentences from dictionary, literary, and web sources, and a fine-tuned NLLB-200 model (lutfiy). We also propose a post-processing method for restoring Arabic-script half-space characters, which improves handling of morphological boundaries. All datasets, models, and tools are released publicly to support future work on Southern Uzbek and other low-resource languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
tahrirchi/lutfiy
model· 3 dl· ♡ 3
3 dl♡ 3

Datasets

tahrirchi/lutfiy
dataset· 15 dl
15 dl

Videos

Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek· underline

Taxonomy

TopicsCentral Asia Education and Culture · Education, Innovation and Language Studies · Economic and Industrial Development