A Large-Scale Study of Machine Translation in the Turkic Languages
Jamshidbek Mirzakhalov, Anoop Babu, Duygu Ataman, Sherzod Kariev,, Francis Tyers, Otabek Abduraufov, Mammad Hajili, Sardana Ivanova, Abror, Khaytbaev, Antonio Laverghetta Jr., Behzodbek Moydinboyev, Esra Onal,, Shaxnoza Pulatova, Ahsan Wahab, Orhan Firat, Sriram Chellappan

TL;DR
This paper conducts a comprehensive large-scale analysis of neural machine translation for Turkic languages, addressing data scarcity and providing resources, baselines, and evaluations to advance translation quality in these languages.
Contribution
It introduces a large parallel corpus, bilingual baselines, new test sets, and human evaluation for Turkic languages, facilitating future NMT research and development.
Findings
Identified bottlenecks in Turkic language NMT systems.
Provided extensive datasets and benchmarks for 26 language pairs.
Released resources publicly to support further research.
Abstract
Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 2 million parallel sentences, ii)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
