Morphological and Language-Agnostic Word Segmentation for NMT
Dominik Mach\'a\v{c}ek, Jon\'a\v{s} Vidra, Ond\v{r}ej Bojar

TL;DR
This paper compares linguistically uninformed and motivated subword segmentation methods for neural machine translation, finding that non-motivated methods currently perform better, but a simple preprocessing step can improve BPE results.
Contribution
It introduces a novel derivational dictionary-based segmentation method and demonstrates how preprocessing can enhance BPE performance in NMT.
Findings
Non-motivated methods outperform linguistically motivated ones in German-Czech translation.
A simple preprocessing step significantly improves BPE translation quality.
Identifies a key difference between BPE and STE methods.
Abstract
The state of the art of handling rich morphology in neural machine translation (NMT) is to break word forms into subword units, so that the overall vocabulary size of these units fits the practical limits given by the NMT model and GPU memory capacity. In this paper, we compare two common but linguistically uninformed methods of subword construction (BPE and STE, the method implemented in Tensor2Tensor toolkit) and two linguistically-motivated methods: Morfessor and one novel method, based on a derivational dictionary. Our experiments with German-to-Czech translation, both morphologically rich, document that so far, the non-motivated methods perform better. Furthermore, we iden- tify a critical difference between BPE and STE and show a simple pre- processing step for BPE that considerably increases translation quality as evaluated by automatic measures.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsByte Pair Encoding
