Data Augmentation With Back translation for Low Resource languages: A   case of English and Luganda

Richard Kimera; Dongnyeong Heo; Daniela N. Rim; Heeyoul Choi

arXiv:2505.02463·cs.CL·May 6, 2025

Data Augmentation With Back translation for Low Resource languages: A case of English and Luganda

Richard Kimera, Dongnyeong Heo, Daniela N. Rim, Heeyoul Choi

PDF

TL;DR

This study demonstrates that back translation significantly improves neural machine translation quality for low-resource English-Luganda by generating synthetic data, surpassing previous benchmarks by over 10 BLEU points.

Contribution

The paper introduces a novel incremental back translation approach with curated datasets, achieving substantial translation quality improvements for low-resource languages.

Findings

01

Translation performance exceeded previous benchmarks by over 10 BLEU points.

02

Back translation effectively mitigates data scarcity in low-resource NMT.

03

Comprehensive evaluation confirms the efficacy of the proposed method.

Abstract

In this paper,we explore the application of Back translation (BT) as a semi-supervised technique to enhance Neural Machine Translation(NMT) models for the English-Luganda language pair, specifically addressing the challenges faced by low-resource languages. The purpose of our study is to demonstrate how BT can mitigate the scarcity of bilingual data by generating synthetic data from monolingual corpora. Our methodology involves developing custom NMT models using both publicly available and web-crawled data, and applying Iterative and Incremental Back translation techniques. We strategically select datasets for incremental back translation across multiple small datasets, which is a novel element of our approach. The results of our study show significant improvements, with translation performance for the English-Luganda pair exceeding previous benchmarks by more than 10 BLEU score units…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.