Understanding and Mitigating Tokenization Bias in Language Models
Buu Phan, Marton Havasi, Matthew Muckley, Karen Ullrich

TL;DR
This paper identifies bias introduced by tokenization schemes in language models and proposes algorithms to obtain unbiased estimates without finetuning, enabling token-free behavior simulation.
Contribution
It introduces novel algorithms to counteract tokenization bias in language models, allowing unbiased estimation and token-free simulation without model finetuning.
Findings
Algorithms accurately recover transition probabilities in Markov-chain setups.
Proposed methods scale linearly with sequence length.
Demonstrates bias correction without additional training.
Abstract
State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction. We show that popular encoding schemes, such as maximum prefix encoding (MPE) and byte-pair-encoding (BPE), induce a sampling bias that cannot be mitigated with more training or data. To counter this universal problem, for each encoding scheme above, we propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data. Our methods do not require finetuning the model, and the complexity, defined as the number of model runs, scales linearly with the sequence length in the case of MPE. As a result, we show that one can simulate token-free behavior from a tokenized language model. We empirically verify the correctness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
