LCP-dropout: Compression-based Multiple Subword Segmentation for Neural Machine Translation
Keita Nonaka, Kazutaka Yamanouchi, Tomohiro I, Tsuyoshi Okita,, Kazutaka Shimada, Hiroshi Sakamoto

TL;DR
This paper introduces LCP-dropout, a probabilistic subword segmentation method based on compression algorithms, which enhances neural machine translation performance, especially with limited training data.
Contribution
It presents a novel probabilistic approach called LCP-dropout that improves upon BPE/BPE-dropout for multiple subword segmentation in neural machine translation.
Findings
Outperforms baseline methods in small data scenarios
Enhances subword segmentation quality
Improves translation accuracy
Abstract
In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in Neural Machine Translation. Among them, BPE/BPE-dropout is one of the fastest and most effective method compared to conventional approaches. However, compression-based approach has a drawback in that generating multiple segmentations is difficult due to the determinism. To overcome this difficulty, we focus on a probabilistic string algorithm, called locally-consistent parsing (LCP), that has been applied to achieve optimum compression. Employing the probabilistic mechanism of LCP, we propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and show that it outperforms various baselines in learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning in Bioinformatics
