TREX: Tokenizer Regression for Optimal Data Mixture
Inho Won, Hangyeol Yoo, Minkyung Cho, Jungyeul Park, Hoyun Song, KyungTae Lim

TL;DR
TREX introduces a regression-based framework that efficiently predicts optimal data mixtures for multilingual tokenizer training, improving compression efficiency and scalability over heuristic methods.
Contribution
TREX presents a novel regression approach to predict optimal data mixtures for tokenizers, reducing reliance on heuristics and costly searches in multilingual LLM training.
Findings
TReX outperforms heuristic mixtures by up to 12% in compression efficiency.
The framework demonstrates strong scalability and robustness.
TReX enables scalable mixture search before large-scale tokenizer training.
Abstract
Building effective tokenizers for multilingual Large Language Models (LLMs) requires careful control over language-specific data mixtures. While a tokenizer's compression performance critically affects the efficiency of LLM training and inference, existing approaches rely on heuristics or costly large-scale searches to determine optimal language ratios. We introduce Tokenizer Regression for Optimal Data MiXture (TREX), a regression-based framework that efficiently predicts the optimal data mixture for tokenizer training. TREX trains small-scale proxy tokenizers on random mixtures, gathers their compression statistics, and learns to predict compression performance from data mixtures. This learned model enables scalable mixture search before large-scale tokenizer training, mitigating the accuracy-cost trade-off in multilingual tokenizer design. Tokenizers trained with TReX's predicted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling
