Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models
Iaroslav Chelombitko, Egor Safronov, Aleksey Komissarov

TL;DR
This paper introduces Qtok, a comprehensive framework for evaluating multilingual tokenizer quality in large language models, highlighting its impact on model performance across diverse languages and providing practical assessment tools.
Contribution
The paper presents a systematic set of metrics and a tool for evaluating multilingual tokenizer quality, addressing a gap in current LLM development practices.
Findings
Significant variation in token distribution across languages and categories.
Identification of biases and areas for improvement in current tokenization strategies.
Qtok enables effective comparison and selection of tokenizers for multilingual models.
Abstract
In the development of Large Language Models (LLMs), considerable attention has been given to the quality of training datasets. However, the role of tokenizers in the LLM training pipeline, particularly for multilingual models, has received less focus. The quality of tokenization can significantly impact a model's ability to handle diverse languages effectively. We introduce Qtok, a tool designed to assess tokenizer quality with a specific emphasis on their performance in multilingual contexts. Our research proposes a set of metrics for evaluating tokenizer quality, including measures of language coverage, token completeness, and distribution across languages and linguistic categories. Qtok applies these metrics to evaluate 13 distinct tokenizers from 58 publicly available models, analyzing their output across different linguistic contexts. Our analysis revealed significant variations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training
