Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models
Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly

TL;DR
This paper introduces AdaptBPE, a novel BPE tokenization method that improves vocabulary adaptation in fine-tuning pretrained language models, leading to better performance on classification and summarization tasks, especially with high OOV content.
Contribution
AdaptBPE modifies BPE initialization to prioritize longest string matching on added vocabulary, enhancing tokenization and model performance in domain-specific fine-tuning.
Findings
AdaptBPE improves accuracy by 3.57% over standard BPE.
AdaptBPE enhances Rouge-L score by 1.87%.
Human evaluation shows more relevant and faithful summaries with AdaptBPE.
Abstract
In this work, we show a fundamental limitation in vocabulary adaptation approaches that use Byte-Pair Encoding (BPE) tokenization scheme for fine-tuning pretrained language models (PLMs) to expert domains. Current approaches trivially append the target domain-specific vocabulary at the end of the PLM vocabulary. This approach leads to a lower priority score and causes sub-optimal tokenization in BPE that iteratively uses merge rules to tokenize a given text. To mitigate this issue, we propose AdaptBPE where the BPE tokenization initialization phase is modified to first perform the longest string matching on the added (target) vocabulary before tokenizing at the character level. We perform an extensive evaluation of AdaptBPE versus the standard BPE over various classification and summarization tasks; AdaptBPE improves by 3.57% (in terms of accuracy) and 1.87% (in terms of Rouge-L),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsByte Pair Encoding
