TREX: Tokenizer Regression for Optimal Data Mixture

Inho Won; Hangyeol Yoo; Minkyung Cho; Jungyeul Park; Hoyun Song; KyungTae Lim

arXiv:2601.13588·cs.CL·January 21, 2026

TREX: Tokenizer Regression for Optimal Data Mixture

Inho Won, Hangyeol Yoo, Minkyung Cho, Jungyeul Park, Hoyun Song, KyungTae Lim

PDF

Open Access 1 Video

TL;DR

TREX introduces a regression-based framework that efficiently predicts optimal data mixtures for multilingual tokenizer training, improving compression efficiency and scalability over heuristic methods.

Contribution

TREX presents a novel regression approach to predict optimal data mixtures for tokenizers, reducing reliance on heuristics and costly searches in multilingual LLM training.

Findings

01

TReX outperforms heuristic mixtures by up to 12% in compression efficiency.

02

The framework demonstrates strong scalability and robustness.

03

TReX enables scalable mixture search before large-scale tokenizer training.

Abstract

Building effective tokenizers for multilingual Large Language Models (LLMs) requires careful control over language-specific data mixtures. While a tokenizer's compression performance critically affects the efficiency of LLM training and inference, existing approaches rely on heuristics or costly large-scale searches to determine optimal language ratios. We introduce Tokenizer Regression for Optimal Data MiXture (TREX), a regression-based framework that efficiently predicts the optimal data mixture for tokenizer training. TREX trains small-scale proxy tokenizers on random mixtures, gathers their compression statistics, and learns to predict compression performance from data mixtures. This learned model enables scalable mixture search before large-scale tokenizer training, mitigating the accuracy-cost trade-off in multilingual tokenizer design. Tokenizers trained with TReX's predicted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

TReX: Tokenizer Regression for Optimal Data Mixture· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling