Towards High-Quality Machine Translation for Kokborok: A Low-Resource Tibeto-Burman Language of Northeast India

Badal Nyalang; Biman Debbarma

arXiv:2604.19778·cs.CL·April 23, 2026

Towards High-Quality Machine Translation for Kokborok: A Low-Resource Tibeto-Burman Language of Northeast India

Badal Nyalang, Biman Debbarma

PDF

1 Models

TL;DR

This paper develops a high-quality neural machine translation system for Kokborok, a low-resource Tibeto-Burman language, achieving significant improvements over previous efforts through multi-source training and model fine-tuning.

Contribution

The authors introduce KokborokMT, a neural translation system trained on diverse data sources, with a new language token, resulting in substantially improved BLEU scores and human evaluation metrics.

Findings

01

BLEU scores of 17.30 (test set) and 38.56 (validation set) achieved

02

Human evaluations show mean adequacy of 3.74/5 and fluency of 3.70/5

03

Substantial improvements over prior Bible-based translation systems

Abstract

We present KokborokMT, a high-quality neural machine translation (NMT) system for Kokborok (ISO 639-3), a Tibeto-Burman language spoken primarily in Tripura, India with approximately 1.5 million speakers. Despite its status as an official language of Tripura, Kokborok has remained severely under-resourced in the NLP community, with prior machine translation attempts limited to systems trained on small Bible-derived corpora achieving BLEU scores below 7. We fine-tune the NLLB-200-distilled-600M model on a multi-source parallel corpus comprising 36,052 sentence pairs: 9,284 professionally translated sentences from the SMOL dataset, 1,769 Bible-domain sentences from WMT shared task data, and 24,999 synthetic back-translated pairs generated via Gemini Flash from Tatoeba English source sentences. We introduce as a new language token for Kokborok in the NLLB framework. Our best system…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
MWirelabs/kokborok-mt
model· 4 dl
4 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.