Parallel Corpus Augmentation using Masked Language Models
Vibhuti Kumari, Narayana Murthy Kavi

TL;DR
This paper introduces a novel parallel corpus augmentation method using multilingual masked language models and sentence embeddings, enabling large-scale, high-quality data expansion without additional monolingual resources.
Contribution
It presents a new approach combining masked language models and sentence embeddings for effective parallel corpus augmentation, reducing data scarcity in machine translation.
Findings
Produces larger, high-quality parallel corpora
Does not require additional monolingual data
Improves translation quality estimation
Abstract
In this paper we propose a novel method of augmenting parallel text corpora which promises good quality and is also capable of producing many fold larger corpora than the seed corpus we start with. We do not need any additional monolingual corpora. We use Multi-Lingual Masked Language Model to mask and predict alternative words in context and we use Sentence Embeddings to check and select sentence pairs which are likely to be translations of each other. We cross check our method using metrics for MT Quality Estimation. We believe this method can greatly alleviate the data scarcity problem for all language pairs for which a reasonable seed corpus is available.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
