How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?
Chantal Amrhein, Rico Sennrich

TL;DR
This paper evaluates the effectiveness of subword segmentation strategies in translating non-concatenative morphology, highlighting challenges and recommending diverse language testing to improve NLP models.
Contribution
It introduces a test suite for assessing segmentation strategies on non-concatenative morphological phenomena in a controlled setting.
Findings
Subword segmentation struggles with non-concatenative phenomena like reduplication and vowel harmony.
Character-level models perform better on complex morphological phenomena.
Rare word stems remain challenging for current segmentation strategies.
Abstract
Data-driven subword segmentation has become the default strategy for open-vocabulary machine translation and other NLP tasks, but may not be sufficiently generic for optimal learning of non-concatenative morphology. We design a test suite to evaluate segmentation strategies on different types of morphological phenomena in a controlled, semi-synthetic setting. In our experiments, we compare how well machine translation models trained on subword- and character-level can translate these morphological phenomena. We find that learning to analyse and generate morphologically complex surface representations is still challenging, especially for non-concatenative morphological phenomena like reduplication or vowel harmony and for rare word stems. Based on our results, we recommend that novel text representation strategies be tested on a range of typologically diverse languages to minimise the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
