Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak
Mukhammadsaid Mamasaidov, Abror Shopulatov

TL;DR
This paper introduces new datasets and neural translation models for the low-resource Karakalpak language, demonstrating improved translation performance and supporting linguistic diversity in NLP.
Contribution
It provides the first large-scale parallel corpora and open-source neural models for Karakalpak translation, advancing low-resource language machine translation.
Findings
Neural models outperform baselines
Datasets enable better translation quality
Open-sourced models facilitate further research
Abstract
This study presents several contributions for the Karakalpak language: a FLORES+ devtest dataset translated to Karakalpak, parallel corpora for Uzbek-Karakalpak, Russian-Karakalpak and English-Karakalpak of 100,000 pairs each and open-sourced fine-tuned neural models for translation across these languages. Our experiments compare different model variants and training approaches, demonstrating improvements over existing baselines. This work, conducted as part of the Open Language Data Initiative (OLDI) shared task, aims to advance machine translation capabilities for Karakalpak and contribute to expanding linguistic diversity in NLP technologies.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Linguistics and Cultural Studies
