Central Yup'ik and Machine Translation of Low-Resource Polysynthetic   Languages

Christopher Liu; Laura Domin\'e; Kevin Chavez; Richard Socher

arXiv:2009.04087·cs.CL·September 10, 2020·1 cites

Central Yup'ik and Machine Translation of Low-Resource Polysynthetic Languages

Christopher Liu, Laura Domin\'e, Kevin Chavez, Richard Socher

PDF

Open Access

TL;DR

This paper develops a neural machine translation system for the low-resource polysynthetic Yup'ik language, including corpus creation, morphological parsing, and evaluation of tokenization methods, to improve translation accuracy.

Contribution

It introduces a parallel corpus, a morphological parser, and compares tokenization techniques for Yup'ik to English translation, advancing low-resource language MT.

Findings

01

Tokenized input improves translation accuracy

02

BPE performs best with smaller vocabularies

03

Morfessor achieves highest BLEU at 30k vocabulary

Abstract

Machine translation tools do not yet exist for the Yup'ik language, a polysynthetic language spoken by around 8,000 people who live primarily in Southwest Alaska. We compiled a parallel text corpus for Yup'ik and English and developed a morphological parser for Yup'ik based on grammar rules. We trained a seq2seq neural machine translation model with attention to translate Yup'ik input into English. We then compared the influence of different tokenization methods, namely rule-based, unsupervised (byte pair encoding), and unsupervised morphological (Morfessor) parsing, on BLEU score accuracy for Yup'ik to English translation. We find that using tokenized input increases the translation accuracy compared to that of unparsed input. Although overall Morfessor did best with a vocabulary size of 30k, our first experiments show that BPE performed best with a reduced vocabulary size.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence · Byte Pair Encoding