SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation
Haiyue Song, Raj Dabre, Chenhui Chu, Sadao Kurohashi, and Eiichiro, Sumita

TL;DR
SelfSeg is a fast, self-supervised sub-word segmentation method for neural machine translation that outperforms existing methods, especially in low-resource scenarios, by requiring only monolingual dictionaries and employing dynamic programming.
Contribution
Introduces SelfSeg, a self-supervised neural sub-word segmentation approach that is more efficient and effective than prior methods, requiring only monolingual data and enabling diverse segmentations.
Findings
SelfSeg improves BLEU scores by over 1.2 on low-resource datasets.
Regularization enhances segmentation diversity and BLEU scores by approximately 4.3.
SelfSeg achieves competitive results across multiple translation datasets.
Abstract
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair Encoding (BPE), however, they are inefficient as they require parallel corpora, days to train and hours to decode. This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method that is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora. SelfSeg takes as input a word in the form of a partially masked character sequence, optimizes the word generation probability and generates the segmentation with the maximum posterior probability, which is calculated using a dynamic programming algorithm. The training time of SelfSeg depends on word frequencies, and we explore several word frequency normalization strategies to accelerate the training phase.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSentencePiece · Byte Pair Encoding
