BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer   Training

Pavel Chizhov; Catherine Arnett; Elizaveta Korotkova; Ivan P.; Yamshchikov

arXiv:2409.04599·cs.CL·September 10, 2024

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Pavel Chizhov, Catherine Arnett, Elizaveta Korotkova, Ivan P., Yamshchikov

PDF

Open Access 1 Repo

TL;DR

This paper introduces Picky BPE, a modified BPE algorithm that refines vocabulary during tokenizer training, improving efficiency and downstream performance without sacrificing compression.

Contribution

The paper presents a novel BPE variant that refines vocabulary during training, addressing under-trained tokens and enhancing downstream model performance.

Findings

01

Improves vocabulary efficiency and eliminates under-trained tokens.

02

Maintains or improves downstream task performance.

03

Does not compromise text compression.

Abstract

Language models can largely benefit from efficient tokenization. However, they still mostly utilize the classical BPE algorithm, a simple and reliable method. This has been shown to cause such issues as under-trained tokens and sub-optimal compression that may affect the downstream performance. We introduce Picky BPE, a modified BPE algorithm that carries out vocabulary refinement during tokenizer training. Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression. Our experiments show that our method does not reduce the downstream performance, and in several cases improves it.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pchizhov/picky_bpe
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsByte Pair Encoding