BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
Pavel Chizhov, Catherine Arnett, Elizaveta Korotkova, Ivan P., Yamshchikov

TL;DR
This paper introduces Picky BPE, a modified BPE algorithm that refines vocabulary during tokenizer training, improving efficiency and downstream performance without sacrificing compression.
Contribution
The paper presents a novel BPE variant that refines vocabulary during training, addressing under-trained tokens and enhancing downstream model performance.
Findings
Improves vocabulary efficiency and eliminates under-trained tokens.
Maintains or improves downstream task performance.
Does not compromise text compression.
Abstract
Language models can largely benefit from efficient tokenization. However, they still mostly utilize the classical BPE algorithm, a simple and reliable method. This has been shown to cause such issues as under-trained tokens and sub-optimal compression that may affect the downstream performance. We introduce Picky BPE, a modified BPE algorithm that carries out vocabulary refinement during tokenizer training. Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression. Our experiments show that our method does not reduce the downstream performance, and in several cases improves it.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsByte Pair Encoding
