Length-MAX Tokenizer for Language Models
Dong Dong, Weijie Su

TL;DR
The Length-MAX tokenizer reduces token count and improves efficiency in language models by optimizing token length, leading to faster training, lower inference latency, and better downstream task performance.
Contribution
It introduces a novel tokenizer that minimizes tokens per character using a graph partitioning approach, outperforming BPE in efficiency and downstream results.
Findings
14-18 ext% fewer tokens than BPE across datasets
Up to 18.5 ext% reduction in training steps to reach validation loss
13-14 ext% lower inference latency and 16 ext% throughput gain
Abstract
We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14--18\% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0\% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5\%, 17.2\%, and 18.5\% fewer steps, respectively, to reach a fixed validation loss, and 13.7\%, 12.7\%, and 13.7\% lower inference latency, together with a 16\% throughput gain at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks
