SPRING Lab IITM's submission to Low Resource Indic Language Translation   Shared Task

Hamees Sayed; Advait Joglekar; Srinivasan Umesh

arXiv:2411.00727·cs.CL·November 12, 2024

SPRING Lab IITM's submission to Low Resource Indic Language Translation Shared Task

Hamees Sayed, Advait Joglekar, Srinivasan Umesh

PDF

Open Access

TL;DR

This paper presents a comprehensive translation pipeline for four low-resource Indic languages, utilizing data augmentation, fine-tuning of pre-trained models, and special token strategies to improve translation quality.

Contribution

The authors develop a robust translation approach for Khasi, Mizo, Manipuri, and Assamese, including data augmentation and model fine-tuning techniques tailored for low-resource languages.

Findings

01

Significant performance improvements over baselines.

02

Effective use of back-translation for data augmentation.

03

Successful adaptation of NLLB model for unsupported language Khasi.

Abstract

We develop a robust translation model for four low-resource Indic languages: Khasi, Mizo, Manipuri, and Assamese. Our approach includes a comprehensive pipeline from data collection and preprocessing to training and evaluation, leveraging data from WMT task datasets, BPCC, PMIndia, and OpenLanguageData. To address the scarcity of bilingual data, we use back-translation techniques on monolingual datasets for Mizo and Khasi, significantly expanding our training corpus. We fine-tune the pre-trained NLLB 3.3B model for Assamese, Mizo, and Manipuri, achieving improved performance over the baseline. For Khasi, which is not supported by the NLLB model, we introduce special tokens and train the model on our Khasi corpus. Our training involves masked language modelling, followed by fine-tuning for English-to-Indic and Indic-to-English translations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques