Batching BPE Tokenization Merges

Alexander P. Morgan

arXiv:2408.04653·cs.CL·August 12, 2024

Batching BPE Tokenization Merges

Alexander P. Morgan

PDF

TL;DR

This paper introduces BatchBPE, a Python implementation that enables efficient batching of BPE merges, making high-quality tokenizer training feasible on low-resource hardware and facilitating experimentation with new tokenization strategies.

Contribution

The paper presents BatchBPE, a novel approach and open-source tool that allows batching in BPE tokenization, reducing memory usage and enabling tokenizer training on basic laptops.

Findings

01

BatchBPE effectively merges hundreds of token pairs simultaneously.

02

It reduces memory footprint during vocabulary training.

03

The tool supports experimentation with preprocessing and merge strategies.

Abstract

The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer's vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training make it feasible to train a high quality tokenizer on a basic laptop. This paper presents BatchBPE, an open-source pure Python implementation of these concepts, with the goal of making experimenting with new tokenization strategies more accessible especially in compute- and memory-constrained contexts. BatchBPE's usefulness and malleability are demonstrated through the training of several token vocabularies to explore the batch merging process and experiment with preprocessing a stop word list and ignoring the least common text chunks in a dataset. Resultant encoded lengths of texts are used as a basic evaluation metric.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.