MUTANT: A Recipe for Multilingual Tokenizer Design

Souvik Rana; Arul Menezes; Ashish Kulkarni; Chandra Khatri; Shubham Agarwal

arXiv:2511.03237·cs.CL·March 24, 2026

MUTANT: A Recipe for Multilingual Tokenizer Design

Souvik Rana, Arul Menezes, Ashish Kulkarni, Chandra Khatri, Shubham Agarwal

PDF

Open Access

TL;DR

MUTANT introduces a comprehensive approach for designing multilingual tokenizers that improve efficiency and performance across diverse languages, with a focus on Indian languages, achieving state-of-the-art results.

Contribution

The paper presents MUTANT, a novel recipe for multilingual tokenizer design incorporating language-aware pre-tokenization and training strategies, and introduces MUTANT-Indic for Indian languages.

Findings

01

39.5% reduction in fertility score over LLaMA4

02

44% increase in inference throughput over LLaMA4

03

State-of-the-art performance on Indian languages and code data

Abstract

Tokenizers play a crucial role in determining the performance, training efficiency, and the inference cost of Large Language Models (LLMs). Designing effective tokenizers for multilingual LLMs is particularly challenging due to diverse scripts and rich morphological variation. While subword methods like Byte Pair Encoding (BPE) are widely adopted, their effectiveness in multilingual settings remains underexplored. We present MUTANT, a recipe for building multilingual tokenizers, with careful vocabulary and training data design, language-aware pre-tokenization, and subword and multiword aware training. We also introduce MUTANT-Indic, a tokenizer for India-specific multilingual LLMs, that produces linguistically coherent tokens and achieves state-of-the-art performance. Evaluated across English, 22 Indian languages and code data, our tokenizer improves the average fertility score by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare