Tokenization with Split Trees
Craig W. Schmidt, Michael Krumdick, Adam Wiemerslage, Seth Ebner, Varshini Reddy, Yuval Pinter, Chris Tanner

TL;DR
ToaST is a novel subword tokenization method that optimizes compression using a recursive inference procedure, leading to fewer tokens and improved efficiency in language models.
Contribution
It introduces a new tokenization approach with a recursive inference and vocabulary optimization, outperforming existing methods in token reduction and model performance.
Findings
Reduces token counts by over 11% compared to BPE, WordPiece, and UnigramLM.
Achieves highest CORE score among 1.5B parameter models, outperforming baselines.
Improves Renyi efficiency by using common single-byte tokens less frequently.
Abstract
We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path. Vocabulary selection is formulated as an Integer Program (IP) that minimizes the total token count over all split trees under this inference procedure. The Linear Programming (LP) relaxation is near-integral in practice, yielding provably near-optimal vocabularies, with training time empirically scaling quadratically in the number of split trees. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
