The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models

Aradhya Dixit; Shreem Dixit

arXiv:2602.11174·cs.CL·February 13, 2026

The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models

Aradhya Dixit, Shreem Dixit

PDF

Open Access

TL;DR

This paper quantifies how script differences impact tokenization efficiency and latency in multilingual models, revealing significant disparities that suggest the need for script-aware tokenization strategies.

Contribution

It introduces a systematic measurement of script-induced costs in multilingual models, highlighting tokenization as a source of inequity and proposing the importance of script-aware methods.

Findings

01

Orthographic variants cause up to 16.5x inference slowdown.

02

Higher fragmentation increases information cost by up to 47%.

03

Tokenization disparities reflect orthography-conditioned processing.

Abstract

Pretrained multilingual language models are often assumed to be script-agnostic, yet their tokenizers can impose systematic costs on certain writing systems. We quantify this script tax by comparing two orthographic variants with identical linguistic content. Across mBERT and XLM-R, the higher-fragmentation orthography shows a ~3.4x increase in fertility (6.73-6.85 vs. 2.10-2.35 tokens/word), leading to a 16.5x inference slowdown (0.23 vs. 3.8 sentences/second) on identical hardware. Using bits per character (BPC) to avoid the "NLL paradox" from subword fragmentation, we find a substantial increase in information cost: +19.7% for mBERT (8.06->9.65) and +47.1% for XLM-R (12.19->17.94). A round-trip conversion check (CER_rt=0.31) suggests these gaps reflect orthography-conditioned processing rather than mapping noise. Our results highlight tokenization as a key source of inequity in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification