LCP-dropout: Compression-based Multiple Subword Segmentation for Neural   Machine Translation

Keita Nonaka; Kazutaka Yamanouchi; Tomohiro I; Tsuyoshi Okita,; Kazutaka Shimada; Hiroshi Sakamoto

arXiv:2202.13590·cs.CL·March 2, 2023

LCP-dropout: Compression-based Multiple Subword Segmentation for Neural Machine Translation

Keita Nonaka, Kazutaka Yamanouchi, Tomohiro I, Tsuyoshi Okita,, Kazutaka Shimada, Hiroshi Sakamoto

PDF

Open Access

TL;DR

This paper introduces LCP-dropout, a probabilistic subword segmentation method based on compression algorithms, which enhances neural machine translation performance, especially with limited training data.

Contribution

It presents a novel probabilistic approach called LCP-dropout that improves upon BPE/BPE-dropout for multiple subword segmentation in neural machine translation.

Findings

01

Outperforms baseline methods in small data scenarios

02

Enhances subword segmentation quality

03

Improves translation accuracy

Abstract

In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in Neural Machine Translation. Among them, BPE/BPE-dropout is one of the fastest and most effective method compared to conventional approaches. However, compression-based approach has a drawback in that generating multiple segmentations is difficult due to the determinism. To overcome this difficulty, we focus on a probabilistic string algorithm, called locally-consistent parsing (LCP), that has been applied to achieve optimum compression. Employing the probabilistic mechanism of LCP, we propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and show that it outperforms various baselines in learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning in Bioinformatics