Beyond Text Compression: Evaluating Tokenizers Across Scales
Jonas F. Lotz, Ant\'onio V. Lopes, Stephan Peitz, Hendra Setiawan, Leonardo Emili

TL;DR
This paper introduces a scalable, efficient framework for evaluating tokenizers across different languages and scales, revealing their varied impacts on multilingual versus English tasks and proposing new intrinsic metrics for better tokenizer assessment.
Contribution
It presents a novel intrinsic evaluation framework for tokenizers, leveraging scaling consistency and Zipf-inspired metrics, improving tokenizer selection for multilingual language models.
Findings
Tokenizer impact is negligible in English tasks but significant in multilingual settings.
Smaller models can predict larger models' tokenizer effects accurately.
New intrinsic metrics correlate better with downstream performance than compression metrics.
Abstract
The choice of tokenizer can profoundly impact language model performance, yet accessible and reliable evaluations of tokenizer quality remain an open challenge. Inspired by scaling consistency, we show that smaller models can accurately predict significant differences in tokenizer impact on larger models at a fraction of the compute cost. By systematically evaluating both English-centric and multilingual tokenizers, we find that tokenizer choice has negligible effects on tasks in English but results in consistent performance differences in multilingual settings. We propose new intrinsic tokenizer metrics inspired by Zipf's law that correlate more strongly with downstream performance than text compression when modeling unseen languages. By combining several metrics to capture multiple aspects of tokenizer behavior, we develop a reliable framework for intrinsic tokenizer evaluations. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Intelligent Tutoring Systems and Adaptive Learning
