Distributional Properties of Subword Regularization

Marco Cognetta; Vil\'em Zouhar; Naoaki Okazaki

arXiv:2408.11443·cs.CL·August 22, 2024

Distributional Properties of Subword Regularization

Marco Cognetta, Vil\'em Zouhar, Naoaki Okazaki

PDF

Open Access

TL;DR

This paper analyzes the distributional biases of stochastic subword regularization methods in NLP, revealing their bias towards limited tokenizations, and proposes a uniform sampling algorithm that enhances translation performance.

Contribution

It uncovers the biased distributional properties of existing stochastic subword regularization schemes and introduces a uniform sampling method to improve model performance.

Findings

01

Biased towards a small set of tokenizations

02

Uniform sampling improves translation quality

03

Proposed method is a drop-in replacement for stochastic schemes

Abstract

Subword regularization, used widely in NLP, improves model performance by reducing the dependency on exact tokenizations, augmenting the training corpus, and exposing the model to more unique contexts during training. BPE and MaxMatch, two popular subword tokenization schemes, have stochastic dropout regularization variants. However, there has not been an analysis of the distributions formed by them. We show that these stochastic variants are heavily biased towards a small set of tokenizations per word. If the benefits of subword regularization are as mentioned, we hypothesize that biasedness artificially limits the effectiveness of these schemes. Thus, we propose an algorithm to uniformly sample tokenizations that we use as a drop-in replacement for the stochastic aspects of existing tokenizers, and find that it improves machine translation quality.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Machine Learning and Algorithms · Digital Filter Design and Implementation

MethodsSparse Evolutionary Training · Byte Pair Encoding · Dropout