Data Augmentation to Address Out-of-Vocabulary Problem in Low-Resource Sinhala-English Neural Machine Translation
Aloka Fernando, Surangika Ranathunga

TL;DR
This paper introduces a data augmentation method for Sinhala-English neural machine translation that considers both syntactic and semantic properties to address out-of-vocabulary words, improving translation quality in low-resource settings.
Contribution
It proposes a novel word and phrase replacement-based data augmentation technique that handles both syntactic and semantic OOV issues simultaneously.
Findings
Semantic constraints alone yield comparable results to syntactic constraints.
Combining both constraints further improves translation quality.
Method is effective for low-resource languages lacking linguistic tools.
Abstract
Out-of-Vocabulary (OOV) is a problem for Neural Machine Translation (NMT). OOV refers to words with a low occurrence in the training data, or to those that are absent from the training data. To alleviate this, word or phrase-based Data Augmentation (DA) techniques have been used. However, existing DA techniques have addressed only one of these OOV types and limit to considering either syntactic constraints or semantic constraints. We present a word and phrase replacement-based DA technique that consider both types of OOV, by augmenting (1) rare words in the existing parallel corpus, and (2) new words from a bilingual dictionary. During augmentation, we consider both syntactic and semantic properties of the words to guarantee fluency in the synthetic sentences. This technique was experimented with low resource Sinhala-English language pair. We observe with only semantic constraints in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
