Where is the signal in tokenization space?
Renato Lui Geh, Honghua Zhang, Kareem Ahmed, Benjie Wang, Guy Van den Broeck

TL;DR
This paper investigates the non-uniqueness of tokenization in Large Language Models, proving computational hardness in identifying the most probable tokenization, and empirically shows that aggregating probabilities over multiple tokenizations enhances model performance.
Contribution
It introduces the concept of non-canonical tokenizations, proves their computational difficulty, and demonstrates that combining their probabilities can improve LLM evaluation results.
Findings
Marginal probability over tokenizations is often indistinguishable from canonical probability.
Aggregating non-canonical tokenization probabilities improves LLM benchmark performance.
Computational hardness results for finding most likely tokenizations.
Abstract
Large Language Models (LLMs) are typically shipped with tokenizers that deterministically encode text into so-called canonical token sequences, to which the LLMs assign probability values. One common assumption is that the probability of a piece of text is the probability of its canonical token sequence. However, the tokenization of a string is not unique: e.g., the Llama2 tokenizer encodes Tokens as [Tok,ens], but [Tok,en,s] also represents the same text. In this paper, we study non-canonical tokenizations. We prove that, given a string, it is computationally hard to find the most likely tokenization for an autoregressive LLM, as well as to compute the marginal probability over all possible tokenizations. We then show how the marginal is, in most cases, indistinguishable from the canonical probability. Surprisingly, we then empirically demonstrate the existence of a significant amount…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNuclear Physics and Applications · Geophysics and Sensor Technology
