BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages
Manuel Mager, Arturo Oncevay, Elisabeth Mager, Katharina Kann, and Ngoc Thang Vu

TL;DR
This study compares morphological segmentation methods and Byte-Pair Encodings for machine translation of four polysynthetic languages, revealing that unsupervised morphological segmentation often outperforms BPEs in translation quality.
Contribution
The paper introduces a comprehensive comparison of segmentation methods for polysynthetic languages and provides new datasets and a parallel corpus for Raramuri and Shipibo-Konibo.
Findings
Unsupervised morphological segmentation outperforms BPEs in most language pairs.
Supervised segmentation achieves better scores but underperforms in translation tasks.
New datasets and a parallel corpus for Raramuri and Shipibo-Konibo are provided.
Abstract
Morphologically-rich polysynthetic languages present a challenge for NLP systems due to data sparsity, and a common strategy to handle this issue is to apply subword segmentation. We investigate a wide variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages: Nahuatl, Raramuri, Shipibo-Konibo, and Wixarika. Then, we compare the morphologically inspired segmentation methods against Byte-Pair Encodings (BPEs) as inputs for machine translation (MT) when translating to and from Spanish. We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently and that, although supervised methods achieve better segmentation scores, they under-perform in MT challenges. Finally, we contribute two new morphological segmentation datasets for Raramuri and Shipibo-Konibo, and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
