Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Machine Translation
Zhong Zhou, Alex Waibel

TL;DR
This paper introduces a massively parallel, lexiconized iterative pretraining method for translating severely low resource languages using multiple source languages and a small amount of target data, achieving significant BLEU score improvements.
Contribution
It proposes a novel systematic approach to select source languages and a new transformer model for low resource translation, with effective lexicon integration and translation combination techniques.
Findings
+23.9 BLEU over baseline in Bible translation
42.8 BLEU score for Portuguese-English medical translation
Effective translation for severely low resource Mayan language
Abstract
We translate a closed text that is known in advance into a severely low resource language by leveraging massive source parallelism. In other words, given a text in 124 source languages, we translate it into a severely low resource language using only ~1,000 lines of low resource data without any external help. Firstly, we propose a systematic method to rank and choose source languages that are close to the low resource language. We call the linguistic definition of language family Family of Origin (FAMO), and we call the empirical definition of higher-ranked languages using our metrics Family of Choice (FAMC). Secondly, we build an Iteratively Pretrained Multilingual Order-preserving Lexiconized Transformer (IPML) to train on ~1,000 lines (~3.5%) of low resource data. To translate named entities correctly, we build a massive lexicon table for 2,939 Bible named entities in 124 source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning in Bioinformatics
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Adam · Dropout · Layer Normalization
