Should you marginalize over possible tokenizations?
Nadezhda Chirkova, Germ\'an Kruszewski, Jos Rozen, Marc Dymetman

TL;DR
This paper investigates whether ignoring tokenization variability in language models significantly affects probability estimates, finding minimal impact for most data but notable differences for complex, long words.
Contribution
The authors develop an importance sampling algorithm to estimate the true string probability by marginalizing over all tokenizations, assessing the impact on model likelihoods.
Findings
Marginalization impact is typically under 0.5% in log-likelihood.
The gap increases for data with long, complex words.
Ignoring marginalization is generally justified for standard datasets.
Abstract
Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is to first transform it into a sequence of tokens that is scored by the model. However, there are exponentially many token sequences that represent any given string. To truly compute the probability of a string one should marginalize over all tokenizations, which is typically intractable. Here, we analyze whether the practice of ignoring the marginalization is justified. To this end, we devise an importance-sampling-based algorithm that allows us to compute estimates of the marginal probabilities and compare them to the default procedure in a range of state-of-the-art models and datasets. Our results show that the gap in log-likelihood is no larger than 0.5% in most cases, but that it becomes more pronounced for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
