Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in   Finetuning Pretrained Language Models

Gunjan Balde; Soumyadeep Roy; Mainack Mondal; Niloy Ganguly

arXiv:2410.03258·cs.CL·April 29, 2025

Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models

Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces AdaptBPE, a novel BPE tokenization method that improves vocabulary adaptation in fine-tuning pretrained language models, leading to better performance on classification and summarization tasks, especially with high OOV content.

Contribution

AdaptBPE modifies BPE initialization to prioritize longest string matching on added vocabulary, enhancing tokenization and model performance in domain-specific fine-tuning.

Findings

01

AdaptBPE improves accuracy by 3.57% over standard BPE.

02

AdaptBPE enhances Rouge-L score by 1.87%.

03

Human evaluation shows more relevant and faithful summaries with AdaptBPE.

Abstract

In this work, we show a fundamental limitation in vocabulary adaptation approaches that use Byte-Pair Encoding (BPE) tokenization scheme for fine-tuning pretrained language models (PLMs) to expert domains. Current approaches trivially append the target domain-specific vocabulary at the end of the PLM vocabulary. This approach leads to a lower priority score and causes sub-optimal tokenization in BPE that iteratively uses merge rules to tokenize a given text. To mitigate this issue, we propose AdaptBPE where the BPE tokenization initialization phase is modified to first perform the longest string matching on the added (target) vocabulary before tokenizing at the character level. We perform an extensive evaluation of AdaptBPE versus the standard BPE over various classification and summarization tasks; AdaptBPE improves by 3.57% (in terms of accuracy) and 1.87% (in terms of Rouge-L),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gb-kgp/adaptbpe
noneOfficial

Videos

Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsByte Pair Encoding