Exact Byte-Level Probabilities from Tokenized Language Models for   FIM-Tasks and Model Ensembles

Buu Phan; Brandon Amos; Itai Gat; Marton Havasi; Matthew Muckley,; Karen Ullrich

arXiv:2410.09303·cs.CL·April 15, 2025

Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles

Buu Phan, Brandon Amos, Itai Gat, Marton Havasi, Matthew Muckley,, Karen Ullrich

PDF

Open Access 1 Video

TL;DR

This paper introduces a method to derive exact byte-level probabilities from tokenized language models, addressing tokenization bias and enabling improved performance in fill-in-the-middle tasks and model ensembles without additional training.

Contribution

It presents the Byte-Token Representation Lemma and a zero-shot algorithm to convert tokenized models into byte-level models, mitigating tokenization bias in a novel way.

Findings

01

18% improvement in FIM benchmarks

02

Seamless integration of models with different vocabularies

03

Up to 3.7% performance gain in model ensembles

Abstract

Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing and comparing the stochastic behavior of tokenized models with their byte-level, or token-free, counterparts. We discover that, even when the two models are statistically equivalent, their predictive distributions over the next byte can be substantially different, a phenomenon we term as ``tokenization bias''. To fully characterize this phenomenon, we introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution. From this result, we develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling