Pretraining Language Models with Subword Regularization: An Empirical Study of BPE Dropout in Low-Resource NLP
Ruan Visser, Trienko Grobler, Marcel Dunaiski

TL;DR
This study investigates the impact of applying BPE dropout during pretraining of language models, finding that stochastic tokenization during both pretraining and fine-tuning enhances low-resource NLP performance.
Contribution
It demonstrates that applying BPE dropout during pretraining, not just fine-tuning, improves downstream results, especially in low-resource settings.
Findings
Best results occur with stochastic tokenization during both pretraining and fine-tuning.
Applying BPE dropout only during fine-tuning can underperform in small-data scenarios.
Pretraining BPE dropout benefits diminish as fine-tuning data increases.
Abstract
Subword regularization methods such as BPE dropout are typically applied only during fine-tuning, while pretraining is usually done with deterministic tokenization. This creates a potential segmentation mismatch between pretraining and fine-tuning. We investigate whether applying BPE dropout during pretraining improves downstream performance in low-resource NLP. We train monolingual and bilingual BERT models on downsampled subsets of English, German, French, Spanish, Kiswahili, and isiXhosa, and evaluate them on XNLI, PAWS-X, PAN-X, and MasakhaNER 2.0. Across tasks, the best results are typically obtained when stochastic tokenization is applied during both pretraining and fine-tuning, whereas applying BPE dropout only during fine-tuning can underperform deterministic tokenization in smaller-data settings. This disadvantage diminishes as fine-tuning data increases, while the benefits of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
