Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles
Buu Phan, Brandon Amos, Itai Gat, Marton Havasi, Matthew Muckley,, Karen Ullrich

TL;DR
This paper introduces a method to derive exact byte-level probabilities from tokenized language models, addressing tokenization bias and enabling improved performance in fill-in-the-middle tasks and model ensembles without additional training.
Contribution
It presents the Byte-Token Representation Lemma and a zero-shot algorithm to convert tokenized models into byte-level models, mitigating tokenization bias in a novel way.
Findings
18% improvement in FIM benchmarks
Seamless integration of models with different vocabularies
Up to 3.7% performance gain in model ensembles
Abstract
Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing and comparing the stochastic behavior of tokenized models with their byte-level, or token-free, counterparts. We discover that, even when the two models are statistically equivalent, their predictive distributions over the next byte can be substantially different, a phenomenon we term as ``tokenization bias''. To fully characterize this phenomenon, we introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution. From this result, we develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
