Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English
Duygu Ataman, Matteo Negri, Marco Turchi, Marcello Federico

TL;DR
This paper introduces a linguistically motivated vocabulary reduction method for neural machine translation that leverages morphological analysis to improve translation accuracy for morphologically rich languages like Turkish.
Contribution
It proposes a new vocabulary reduction approach based on unsupervised morphology learning and supervised analysis, enhancing translation quality in NMT systems for complex languages.
Findings
Achieved a 2.3 BLEU point improvement over conventional methods.
Effectively reduces vocabulary size while preserving morphological and semantic integrity.
Demonstrated better translation accuracy for Turkish-to-English NMT.
Abstract
The necessity of using a fixed-size word vocabulary in order to control the model complexity in state-of-the-art neural machine translation (NMT) systems is an important bottleneck on performance, especially for morphologically rich languages. Conventional methods that aim to overcome this problem by using sub-word or character-level representations solely rely on statistics and disregard the linguistic properties of words, which leads to interruptions in the word structure and causes semantic and syntactic losses. In this paper, we propose a new vocabulary reduction method for NMT, which can reduce the vocabulary of a given input corpus at any rate while also considering the morphological properties of the language. Our method is based on unsupervised morphology learning and can be, in principle, used for pre-processing any language pair. We also present an alternative word…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
