How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?
Ali Araabi, Christof Monz, Vlad Niculae

TL;DR
This paper evaluates how effectively Byte Pair Encoding (BPE) helps neural machine translation systems handle out-of-vocabulary words, revealing that while useful, BPE often fails to correctly translate many OOV words, with better results for specific cases.
Contribution
The study provides an explicit analysis of BPE's success in translating OOV words at the word level, highlighting its limitations and conditions for improved performance.
Findings
BPE is fairly useful but often translates OOV words incorrectly.
BPE performs better for named-entities and linguistically similar language pairs.
A significant percentage of OOV words are still mistranslated despite BPE use.
Abstract
Neural Machine Translation (NMT) is an open vocabulary problem. As a result, dealing with the words not occurring during training (a.k.a. out-of-vocabulary (OOV) words) have long been a fundamental challenge for NMT systems. The predominant method to tackle this problem is Byte Pair Encoding (BPE) which splits words, including OOV words, into sub-word segments. BPE has achieved impressive results for a wide range of translation tasks in terms of automatic evaluation metrics. While it is often assumed that by using BPE, NMT systems are capable of handling OOV words, the effectiveness of BPE in translating OOV words has not been explicitly measured. In this paper, we study to what extent BPE is successful in translating OOV words at the word-level. We analyze the translation quality of OOV words based on word type, number of segments, cross-attention weights, and the frequency of segment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsByte Pair Encoding
