Broken Words, Broken Performance: Effect of Tokenization on Performance of LLMs

Sachin Pawar; Manoj Apte; Kshitij Jadhav; Girish Keshav Palshikar; Nitin Ramrakhiyani

arXiv:2512.21933·cs.CL·December 29, 2025

Broken Words, Broken Performance: Effect of Tokenization on Performance of LLMs

Sachin Pawar, Manoj Apte, Kshitij Jadhav, Girish Keshav Palshikar, Nitin Ramrakhiyani

PDF

Open Access

TL;DR

This paper investigates how the way LLMs tokenize text, especially breaking natural words into multiple tokens, can negatively affect their performance across various NLP tasks, using penalty functions to quantify this impact.

Contribution

It introduces tokenization penalty functions to measure the negative effects of word breaking on LLM performance, supported by statistical analysis across multiple models and tasks.

Findings

01

Broken words correlate with decreased model performance

02

Tokenization penalties significantly predict NLP task outcomes

03

Natural word breaking impacts LLM efficiency and accuracy

Abstract

Tokenization is the first step in training any Large Language Model (LLM), where the text is split into a sequence of tokens as per the model's fixed vocabulary. This tokenization in LLMs is different from the traditional tokenization in NLP where the text is split into a sequence of "natural" words. In LLMs, a natural word may also be broken into multiple tokens due to limited vocabulary size of the LLMs (e.g., Mistral's tokenizer splits "martial" into "mart" and "ial"). In this paper, we hypothesize that such breaking of natural words negatively impacts LLM performance on various NLP tasks. To quantify this effect, we propose a set of penalty functions that compute a tokenization penalty for a given text for a specific LLM, indicating how "bad" the tokenization is. We establish statistical significance of our hypothesis on multiple NLP tasks for a set of different LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification