No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data

Dmitry Karpov

arXiv:2602.04442·cs.CL·February 5, 2026

No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data

Dmitry Karpov

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper investigates machine translation for five Turkic languages using synthetic and original data, fine-tuning models and employing retrieval techniques to improve translation quality, and releases datasets and models.

Contribution

It introduces new translation models and datasets for Turkic languages, combining synthetic data fine-tuning and retrieval-based prompting methods.

Findings

01

Fine-tuning with synthetic data yields high translation quality for Kazakh and Bashkir.

02

Retrieval-based prompting improves translation for Chuvash.

03

Zero-shot approaches perform competitively for Tatar and Kyrgyz.

Abstract

We explore machine translation for five Turkic language pairs: Russian-Bashkir, Russian-Kazakh, Russian-Kyrgyz, English-Tatar, English-Chuvash. Fine-tuning nllb-200-distilled-600M with LoRA on synthetic data achieved chrF++ 49.71 for Kazakh and 46.94 for Bashkir. Prompting DeepSeek-V3.2 with retrieved similar examples achieved chrF++ 39.47 for Chuvash. For Tatar, zero-shot or retrieval-based approaches achieved chrF++ 41.6, while for Kyrgyz the zero-shot approach reached 45.6. We release the dataset and the obtained weights.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

dimakarp1996/YaTURK-7lang
dataset· 68 dl
68 dl

Videos

No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling