The Effectiveness of Morphology-aware Segmentation in Low-Resource   Neural Machine Translation

Jonne S\"alev\"a; Constantine Lignos

arXiv:2103.11189·cs.CL·May 17, 2024

The Effectiveness of Morphology-aware Segmentation in Low-Resource Neural Machine Translation

Jonne S\"alev\"a, Constantine Lignos

PDF

TL;DR

This study compares subword segmentation methods in low-resource neural machine translation, finding no consistent advantage of morphologically-based methods over BPE across different language pairs.

Contribution

It provides an empirical evaluation of morphological versus BPE segmentation methods in low-resource NMT, highlighting their comparable performance.

Findings

01

Morphologically-based methods outperform BPE in some cases.

02

No consistent performance difference between segmentation methods.

03

Segmentation method performance varies across language pairs.

Abstract

This paper evaluates the performance of several modern subword segmentation methods in a low-resource neural machine translation setting. We compare segmentations produced by applying BPE at the token or sentence level with morphologically-based segmentations from LMVR and MORSEL. We evaluate translation tasks between English and each of Nepali, Sinhala, and Kazakh, and predict that using morphologically-based segmentation methods would lead to better performance in this setting. However, comparing to BPE, we find that no consistent and reliable differences emerge between the segmentation methods. While morphologically-based methods outperform BPE in a few cases, what performs best tends to vary across tasks, and the performance of segmentation methods is often statistically indistinguishable.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsByte Pair Encoding