Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance
Saumitra Yadav, Manish Shrivastava

TL;DR
This paper shows that using asymmetric Byte Pair Encoding with different merge operations for source and target languages improves machine translation performance, especially in low-resource scenarios, over traditional symmetric BPE.
Contribution
It introduces the concept of asymmetric BPE segmentation with different NMOs for source and target, demonstrating significant improvements across multiple language pairs and data sizes.
Findings
Asymmetric BPE yields statistically significant gains in translation quality.
Optimal NMO varies between source (high) and target (low) languages.
Low-resource MT benefits most from asymmetric BPE segmentation.
Abstract
Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers for both source and target languages. However, we demonstrate that this uniform approach doesn't guarantee optimal MT performance across different language pairs and data sizes. This work investigates BPE segmentation recipes across various data volumes and language pairs to evaluate MT system performance. We find that utilizing asymmetric BPE, where the source and target languages have different NMOs, significantly improves results over the symmetric approach, especially in low-resource settings (50K, 100K, and 500K sentence pairs). Specifically, asymmetric BPE yield statistically significant () average gains of 5.32, 4.46, and 0.7 CHRF++ on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
