Causal Estimation of Tokenisation Bias
Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, Tiago Pimentel

TL;DR
This paper introduces a causal framework to quantify how the inclusion or exclusion of subwords in tokeniser vocabularies impacts language model probabilities, revealing significant biases that influence model outputs.
Contribution
It presents a novel causal estimation method using regression discontinuity design to measure tokenisation bias effects on language models.
Findings
Tokenisation bias significantly affects model probabilities.
Presence of subwords can increase character probability up to 17 times.
Tokenisation choice is a crucial factor in language model performance.
Abstract
Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings. Ideally, the choice of the tokeniser -- which maps character-strings to subwords -- should not affect the probability assigned to the underlying character-string; in practice, it does. We define this mismatch as tokenisation bias. In this work, we quantify one particular type of tokenisation bias: the effect of including or not a subword (e.g., ) in a tokeniser's vocabulary on the probability a trained model assigns to the corresponding characters (i.e., \textit{``hello''}). Estimating this effect is challenging because each model is trained with only one tokeniser. We address this by framing tokenisation bias as a causal effect and estimating it using the regression discontinuity design. Specifically, we exploit the fact that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStatistical Methods in Clinical Trials
