Parallel Corpus Augmentation using Masked Language Models

Vibhuti Kumari; Narayana Murthy Kavi

arXiv:2410.03194·cs.CL·October 7, 2024

Parallel Corpus Augmentation using Masked Language Models

Vibhuti Kumari, Narayana Murthy Kavi

PDF

Open Access

TL;DR

This paper introduces a novel parallel corpus augmentation method using multilingual masked language models and sentence embeddings, enabling large-scale, high-quality data expansion without additional monolingual resources.

Contribution

It presents a new approach combining masked language models and sentence embeddings for effective parallel corpus augmentation, reducing data scarcity in machine translation.

Findings

01

Produces larger, high-quality parallel corpora

02

Does not require additional monolingual data

03

Improves translation quality estimation

Abstract

In this paper we propose a novel method of augmenting parallel text corpora which promises good quality and is also capable of producing many fold larger corpora than the seed corpus we start with. We do not need any additional monolingual corpora. We use Multi-Lingual Masked Language Model to mask and predict alternative words in context and we use Sentence Embeddings to check and select sentence pairs which are likely to be translations of each other. We cross check our method using metrics for MT Quality Estimation. We believe this method can greatly alleviate the data scarcity problem for all language pairs for which a reasonable seed corpus is available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems