Shiksha: A Technical Domain focused Translation Dataset and Model for   Indian Languages

Advait Joglekar; Srinivasan Umesh

arXiv:2412.09025·cs.CL·December 13, 2024

Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages

Advait Joglekar, Srinivasan Umesh

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces Shiksha, a large multilingual translation dataset and model focused on Indian languages, improving scientific and technical translation performance for low-resource languages.

Contribution

The creation of a 2.8 million sentence parallel corpus for Indian languages and the development of models that outperform existing ones on in-domain and out-of-domain translation tasks.

Findings

01

Surpassed all publicly available models on in-domain tasks.

02

Improved BLEU scores by over 2 on out-of-domain benchmarks.

03

Released a high-quality dataset and translation model for Indian languages.

Abstract

Neural Machine Translation (NMT) models are typically trained on datasets with limited exposure to Scientific, Technical and Educational domains. Translation models thus, in general, struggle with tasks that involve scientific understanding or technical jargon. Their performance is found to be even worse for low-resource Indian languages. Finding a translation dataset that tends to these domains in particular, poses a difficult challenge. In this paper, we address this by creating a multilingual parallel corpus containing more than 2.8 million rows of English-to-Indic and Indic-to-Indic high-quality translation pairs across 8 Indian languages. We achieve this by bitext mining human-translated transcriptions of NPTEL video lectures. We also finetune and evaluate NMT models using this corpus and surpass all other publicly available models at in-domain tasks. We also demonstrate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
SPRINGLab/shiksha-MT-nllb-3.3B
model· 29 dl· ♡ 1
29 dl♡ 1

Datasets

SPRINGLab/shiksha
dataset· 53 dl
53 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Translation Studies and Practices