Central Yup'ik and Machine Translation of Low-Resource Polysynthetic Languages
Christopher Liu, Laura Domin\'e, Kevin Chavez, Richard Socher

TL;DR
This paper develops a neural machine translation system for the low-resource polysynthetic Yup'ik language, including corpus creation, morphological parsing, and evaluation of tokenization methods, to improve translation accuracy.
Contribution
It introduces a parallel corpus, a morphological parser, and compares tokenization techniques for Yup'ik to English translation, advancing low-resource language MT.
Findings
Tokenized input improves translation accuracy
BPE performs best with smaller vocabularies
Morfessor achieves highest BLEU at 30k vocabulary
Abstract
Machine translation tools do not yet exist for the Yup'ik language, a polysynthetic language spoken by around 8,000 people who live primarily in Southwest Alaska. We compiled a parallel text corpus for Yup'ik and English and developed a morphological parser for Yup'ik based on grammar rules. We trained a seq2seq neural machine translation model with attention to translate Yup'ik input into English. We then compared the influence of different tokenization methods, namely rule-based, unsupervised (byte pair encoding), and unsupervised morphological (Morfessor) parsing, on BLEU score accuracy for Yup'ik to English translation. We find that using tokenized input increases the translation accuracy compared to that of unparsed input. Although overall Morfessor did best with a vocabulary size of 30k, our first experiments show that BPE performed best with a reduced vocabulary size.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence · Byte Pair Encoding
