TL;DR
This paper develops a high-quality neural machine translation system for Kokborok, a low-resource Tibeto-Burman language, achieving significant improvements over previous efforts through multi-source training and model fine-tuning.
Contribution
The authors introduce KokborokMT, a neural translation system trained on diverse data sources, with a new language token, resulting in substantially improved BLEU scores and human evaluation metrics.
Findings
BLEU scores of 17.30 (test set) and 38.56 (validation set) achieved
Human evaluations show mean adequacy of 3.74/5 and fluency of 3.70/5
Substantial improvements over prior Bible-based translation systems
Abstract
We present KokborokMT, a high-quality neural machine translation (NMT) system for Kokborok (ISO 639-3), a Tibeto-Burman language spoken primarily in Tripura, India with approximately 1.5 million speakers. Despite its status as an official language of Tripura, Kokborok has remained severely under-resourced in the NLP community, with prior machine translation attempts limited to systems trained on small Bible-derived corpora achieving BLEU scores below 7. We fine-tune the NLLB-200-distilled-600M model on a multi-source parallel corpus comprising 36,052 sentence pairs: 9,284 professionally translated sentences from the SMOL dataset, 1,769 Bible-domain sentences from WMT shared task data, and 24,999 synthetic back-translated pairs generated via Gemini Flash from Tatoeba English source sentences. We introduce as a new language token for Kokborok in the NLLB framework. Our best system…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
