The Art of Breaking Words: Rethinking Multilingual Tokenizer Design
Aamod Thakur, Ajay Nagpal, Atharva Savarkar, Kundeshwar Pundalik, Siddhesh Dosi, Piyush Sawarkar, Viraj Thakur, Rohit Saluja, Maunendra Sankar Desarkar, Ganesh Ramakrishnan

TL;DR
This paper systematically analyzes multilingual tokenizer design, proposing a novel data composition algorithm and pretokenization strategies that significantly improve token efficiency, model performance, and inference speed in Indic scripts.
Contribution
It introduces a new data composition algorithm and pretokenization strategies that enhance multilingual tokenizer efficiency and model quality, especially for Indic scripts.
Findings
Reduced token-to-word ratio by ~6% with the new algorithm
Achieved over 40% improvement in token-to-word ratio against state-of-the-art models
Improved model performance and inference speed through optimized tokenization
Abstract
While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high token-to-word ratios, inefficient use of context length, and slower inference. We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality. To ground our analysis in a linguistically diverse context, we conduct extensive experiments on Indic scripts, which present unique challenges due to their high script diversity and orthographic complexity. Drawing on the insights from these analyses, we propose a novel algorithm for data composition that balances multilingual data for tokenizer training. Our observations on pretokenization strategies…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is backed by a large-scale empirical study across 16 Indian languages and multiple domains (code, math, text). The experimental scope is impressive, covering both vocabulary scaling and pre-tokenization. 2. The iterative reweighting algorithm based on tokenization fertility is elegant and addresses a real gap in multilingual tokenizer design, i.e., unbalanced sampling that harms low-resource, morphologically complex languages. 3. The authors demonstrate deep awareness of Indic lingu
1. Although perplexity is reported for small models, there are no large-scale experiments on full LLMs (e.g., GPT, LLaMA, Qwen) to confirm that tokenization improvements translate into stronger language modeling or instruction-following performance. 2. The AdaptMix algorithm is empirically effective but lacks theoretical discussion, convergence properties, relation to distributionally robust optimization, or statistical guarantees of balanced fertility. 3. While the focus on Indic scripts is jus
S1: The authors focus on non‑Latin scripts and undertake foundational research on complex, understudied multilingual Indic language models, thereby providing a valuable basis for future work in the field. S2: The authors provide a detailed analysis of tokenizer design—examining vocabulary size, pre‑tokenization rules, and data composition methods—which facilitates a multi‑level understanding of the proposed method’s effectiveness in low‑resource, morphologically complex language scenarios.
W1: The authors place excessive emphasis on the token‑to‑word metric (i.e., vocabulary compression rate). Prior work has shown that higher compression is not necessarily better; excessively high compression can degrade generalization, especially when transferring to new corpora. Therefore, the paper's strong emphasis on this single ratio is unjustified. W2: In Section 4.1 the authors should adopt a more scientific and systematic criterion for selecting vocabulary size to quantify the trade‑off
1.The authors constructed a dataset covering 16 Indian languages. 2.The paper is well-structured and clearly written; the proposed method is introduced in a concise and easy-to-understand manner.
1.The proposed AdaptMix method requires multiple iterations, which makes it more computationally expensive compared to other tokenizer training approaches. 2.The authors focus primarily on reducing the token-to-word ratio. Although this ratio is indeed an important indicator for tokenizer efficiency, a lower ratio does not necessarily guarantee better performance for large language models. It raises the question of whether reducing the ratio might compromise the semantic representation of morph
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
