Length-MAX Tokenizer for Language Models

Dong Dong; Weijie Su

arXiv:2511.20849·cs.CL·November 27, 2025

Length-MAX Tokenizer for Language Models

Dong Dong, Weijie Su

PDF

Open Access

TL;DR

The Length-MAX tokenizer reduces token count and improves efficiency in language models by optimizing token length, leading to faster training, lower inference latency, and better downstream task performance.

Contribution

It introduces a novel tokenizer that minimizes tokens per character using a graph partitioning approach, outperforming BPE in efficiency and downstream results.

Findings

01

14-18 ext% fewer tokens than BPE across datasets

02

Up to 18.5 ext% reduction in training steps to reach validation loss

03

13-14 ext% lower inference latency and 16 ext% throughput gain

Abstract

We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14--18\% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0\% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5\%, 17.2\%, and 18.5\% fewer steps, respectively, to reach a fixed validation loss, and 13.7\%, 12.7\%, and 13.7\% lower inference latency, together with a 16\% throughput gain at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks