A Call for Prudent Choice of Subword Merge Operations in Neural Machine Translation
Shuoyang Ding, Adithya Renduchintala, Kevin Duh

TL;DR
This paper systematically investigates how the number of BPE merge operations affects neural machine translation performance across different architectures and language pairs, providing guidance for optimal subword segmentation choices.
Contribution
It offers a comprehensive analysis of BPE merge operation effects, highlighting architecture-dependent optimal configurations and emphasizing the importance of careful BPE selection.
Findings
For LSTM architectures, no single optimal BPE size; wide experimentation needed.
For Transformer architectures, smaller BPE sizes are generally better.
Sub-optimal BPE choices can reduce BLEU scores by 3-4 points.
Abstract
Most neural machine translation systems are built upon subword units extracted by methods such as Byte-Pair Encoding (BPE) or wordpiece. However, the choice of number of merge operations is generally made by following existing recipes. In this paper, we conduct a systematic exploration on different numbers of BPE merge operations to understand how it interacts with the model architecture, the strategy to build vocabularies and the language pair. Our exploration could provide guidance for selecting proper BPE configurations in the future. Most prominently: we show that for LSTM-based architectures, it is necessary to experiment with a wide range of different BPE operations as there is no typical optimal BPE configuration, whereas for Transformer architectures, smaller BPE size tends to be a typically optimal choice. We urge the community to make prudent choices with subword merge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax · Dropout
