TL;DR
This paper introduces BPE-dropout, a simple regularization method that stochastically corrupts BPE segmentation to produce multiple segmentations, improving machine translation quality by up to 3 BLEU points.
Contribution
It demonstrates that BPE can inherently produce multiple segmentations and proposes BPE-dropout to enhance model robustness and translation performance.
Findings
Improves translation quality by up to 3 BLEU points.
Enables BPE to produce multiple segmentations.
Compatible with standard BPE during inference.
Abstract
Subword segmentation is widely used to address the open vocabulary problem in machine translation. The dominant approach to subword segmentation is Byte Pair Encoding (BPE), which keeps the most frequent words intact while splitting the rare ones into multiple tokens. While multiple segmentations are possible even with the same vocabulary, BPE splits words into unique sequences; this may prevent a model from better learning the compositionality of words and being robust to segmentation errors. So far, the only way to overcome this BPE imperfection, its deterministic nature, was to create another subword segmentation algorithm (Kudo, 2018). In contrast, we show that BPE itself incorporates the ability to produce multiple segmentations of the same word. We introduce BPE-dropout - simple and effective subword regularization method based on and compatible with conventional BPE. It…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsByte Pair Encoding
