The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

Aamod Thakur; Ajay Nagpal; Atharva Savarkar; Kundeshwar Pundalik; Siddhesh Dosi; Piyush Sawarkar; Viraj Thakur; Rohit Saluja; Maunendra Sankar Desarkar; Ganesh Ramakrishnan

arXiv:2508.06533·cs.CL·August 12, 2025

The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

Aamod Thakur, Ajay Nagpal, Atharva Savarkar, Kundeshwar Pundalik, Siddhesh Dosi, Piyush Sawarkar, Viraj Thakur, Rohit Saluja, Maunendra Sankar Desarkar, Ganesh Ramakrishnan

PDF

Open Access 3 Reviews

TL;DR

This paper systematically analyzes multilingual tokenizer design, proposing a novel data composition algorithm and pretokenization strategies that significantly improve token efficiency, model performance, and inference speed in Indic scripts.

Contribution

It introduces a new data composition algorithm and pretokenization strategies that enhance multilingual tokenizer efficiency and model quality, especially for Indic scripts.

Findings

01

Reduced token-to-word ratio by ~6% with the new algorithm

02

Achieved over 40% improvement in token-to-word ratio against state-of-the-art models

03

Improved model performance and inference speed through optimized tokenization

Abstract

While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high token-to-word ratios, inefficient use of context length, and slower inference. We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality. To ground our analysis in a linguistically diverse context, we conduct extensive experiments on Indic scripts, which present unique challenges due to their high script diversity and orthographic complexity. Drawing on the insights from these analyses, we propose a novel algorithm for data composition that balances multilingual data for tokenizer training. Our observations on pretokenization strategies…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper is backed by a large-scale empirical study across 16 Indian languages and multiple domains (code, math, text). The experimental scope is impressive, covering both vocabulary scaling and pre-tokenization. 2. The iterative reweighting algorithm based on tokenization fertility is elegant and addresses a real gap in multilingual tokenizer design, i.e., unbalanced sampling that harms low-resource, morphologically complex languages. 3. The authors demonstrate deep awareness of Indic lingu

Weaknesses

1. Although perplexity is reported for small models, there are no large-scale experiments on full LLMs (e.g., GPT, LLaMA, Qwen) to confirm that tokenization improvements translate into stronger language modeling or instruction-following performance. 2. The AdaptMix algorithm is empirically effective but lacks theoretical discussion, convergence properties, relation to distributionally robust optimization, or statistical guarantees of balanced fertility. 3. While the focus on Indic scripts is jus

Reviewer 02Rating 2Confidence 4

Strengths

S1: The authors focus on non‑Latin scripts and undertake foundational research on complex, understudied multilingual Indic language models, thereby providing a valuable basis for future work in the field. S2: The authors provide a detailed analysis of tokenizer design—examining vocabulary size, pre‑tokenization rules, and data composition methods—which facilitates a multi‑level understanding of the proposed method’s effectiveness in low‑resource, morphologically complex language scenarios.

Weaknesses

W1: The authors place excessive emphasis on the token‑to‑word metric (i.e., vocabulary compression rate). Prior work has shown that higher compression is not necessarily better; excessively high compression can degrade generalization, especially when transferring to new corpora. Therefore, the paper's strong emphasis on this single ratio is unjustified. W2: In Section 4.1 the authors should adopt a more scientific and systematic criterion for selecting vocabulary size to quantify the trade‑off

Reviewer 03Rating 4Confidence 4

Strengths

1.The authors constructed a dataset covering 16 Indian languages. 2.The paper is well-structured and clearly written; the proposed method is introduced in a concise and easy-to-understand manner.

Weaknesses

1.The proposed AdaptMix method requires multiple iterations, which makes it more computationally expensive compared to other tokenizer training approaches. 2.The authors focus primarily on reducing the token-to-word ratio. Although this ratio is indeed an important indicator for tokenizer efficiency, a lower ratio does not necessarily guarantee better performance for large language models. It raises the question of whether reducing the ratio might compromise the semantic representation of morph

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification