Tokenization is Sensitive to Language Variation
Anna Wegmann, Dong Nguyen, David Jurgens

TL;DR
This paper investigates how language variation affects tokenization and downstream language model performance, highlighting the importance of tokenizer design choices and introducing a new impact estimation method.
Contribution
It systematically analyzes the influence of tokenizer parameters on different task types and proposes a novel approach to estimate tokenizer impact on LLM performance.
Findings
Pre-tokenizer choice has the largest impact on performance.
Different tokenizers are optimal for robustness versus sensitivity tasks.
A new impact estimation method outperforms existing metrics.
Abstract
Variation in language is ubiquitous and often systematically linked to regional, social, and contextual factors. Tokenizers split texts into smaller units and might behave differently for less common linguistic forms. This might affect downstream LLM performance differently on two types of tasks: Tasks where the model should be robust to language variation (e.g., for semantic tasks like NLI, labels do not depend on whether a text uses British or American spelling) and tasks where the model should be sensitive to language variation (e.g., for form-based tasks like authorship verification, labels depend on whether a text uses British or American spelling). We pre-train BERT base models with the popular Byte-Pair Encoding algorithm to investigate how key tokenization design choices impact the performance of downstream models: the corpus used to train the tokenizer, the pre-tokenizer and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · 7 Fastest Ways to Call American Airlines Reservations Number (USA Guide) · Adam · Softmax · Dropout · Weight Decay · Linear Layer · Layer Normalization · WordPiece
