Causal Estimation of Tokenisation Bias

Pietro Lesci; Clara Meister; Thomas Hofmann; Andreas Vlachos; Tiago Pimentel

arXiv:2506.03149·cs.CL·June 4, 2025

Causal Estimation of Tokenisation Bias

Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, Tiago Pimentel

PDF

Open Access 1 Video

TL;DR

This paper introduces a causal framework to quantify how the inclusion or exclusion of subwords in tokeniser vocabularies impacts language model probabilities, revealing significant biases that influence model outputs.

Contribution

It presents a novel causal estimation method using regression discontinuity design to measure tokenisation bias effects on language models.

Findings

01

Tokenisation bias significantly affects model probabilities.

02

Presence of subwords can increase character probability up to 17 times.

03

Tokenisation choice is a crucial factor in language model performance.

Abstract

Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings. Ideally, the choice of the tokeniser -- which maps character-strings to subwords -- should not affect the probability assigned to the underlying character-string; in practice, it does. We define this mismatch as tokenisation bias. In this work, we quantify one particular type of tokenisation bias: the effect of including or not a subword (e.g., $⟨ h e l l o ⟩$ ) in a tokeniser's vocabulary on the probability a trained model assigns to the corresponding characters (i.e., \textit{``hello''}). Estimating this effect is challenging because each model is trained with only one tokeniser. We address this by framing tokenisation bias as a causal effect and estimating it using the regression discontinuity design. Specifically, we exploit the fact that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Causal Estimation of Tokenisation Bias· underline

Taxonomy

TopicsStatistical Methods in Clinical Trials