Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages
Advait Joglekar, Srinivasan Umesh

TL;DR
This paper introduces Shiksha, a large multilingual translation dataset and model focused on Indian languages, improving scientific and technical translation performance for low-resource languages.
Contribution
The creation of a 2.8 million sentence parallel corpus for Indian languages and the development of models that outperform existing ones on in-domain and out-of-domain translation tasks.
Findings
Surpassed all publicly available models on in-domain tasks.
Improved BLEU scores by over 2 on out-of-domain benchmarks.
Released a high-quality dataset and translation model for Indian languages.
Abstract
Neural Machine Translation (NMT) models are typically trained on datasets with limited exposure to Scientific, Technical and Educational domains. Translation models thus, in general, struggle with tasks that involve scientific understanding or technical jargon. Their performance is found to be even worse for low-resource Indian languages. Finding a translation dataset that tends to these domains in particular, poses a difficult challenge. In this paper, we address this by creating a multilingual parallel corpus containing more than 2.8 million rows of English-to-Indic and Indic-to-Indic high-quality translation pairs across 8 Indian languages. We achieve this by bitext mining human-translated transcriptions of NPTEL video lectures. We also finetune and evaluate NMT models using this corpus and surpass all other publicly available models at in-domain tasks. We also demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices
