Get away with less: Need of source side data curation to build parallel corpus for low resource Machine Translation

Saumitra Yadav; Manish Shrivastava

arXiv:2601.08629·cs.CL·March 12, 2026

Get away with less: Need of source side data curation to build parallel corpus for low resource Machine Translation

Saumitra Yadav, Manish Shrivastava

PDF

Open Access

TL;DR

This paper introduces LALITA, a novel framework for selecting source sentences based on lexical and linguistic features to efficiently create parallel corpora, significantly reducing data needs and improving low-resource machine translation quality.

Contribution

LALITA is a new sentence selection method that enhances low-resource MT by curating effective parallel data using linguistic insights, reducing data requirements by over 50%.

Findings

01

Training on complex sentences improves translation quality.

02

LALITA reduces data needs across multiple languages.

03

Method enhances MT performance with less data.

Abstract

Data curation is a critical yet under-researched step in the machine translation training paradigm. To train translation systems, data acquisition relies primarily on human translations and digital parallel sources or, to a limited degree, synthetic generation. But, for low-resource languages, human translation to generate sufficient data is prohibitively expensive. Therefore, it is crucial to develop a framework that screens source sentences to form efficient parallel text, ensuring optimal MT system performance in low-resource environments. We approach this by evaluating English-Hindi bi-text to determine effective sentence selection strategies for optimal MT system training. Our extensively tested framework, (Lexical And Linguistically Informed Text Analysis) LALITA, targets source sentence selection using lexical and linguistic features to curate parallel corpora. We find that by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification