TL;DR
This paper investigates how subword tokenization efficiency, measured via information-theoretic metrics like Rényi entropy, correlates with downstream NLP model performance, especially in machine translation.
Contribution
It introduces an information-theoretic framework for understanding tokenizer effectiveness and demonstrates Rényi entropy as a strong predictor of translation quality.
Findings
Rényi entropy with α=2.5 correlates highly with BLEU scores.
Optimal Shannon entropy encoding can produce very long codes for rare tokens.
Efficiency metrics based on Rényi entropy outperform compressed length in predicting performance.
Abstract
Subword tokenization is a key part of many NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to better downstream model performance than others. We propose that good tokenizers lead to \emph{efficient} channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum possible entropy of the token distribution. Yet, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency tokens and very short codes to high-frequency tokens. Defining efficiency in terms of R\'enyi entropy, on the other hand, penalizes distributions with either very high or very low-frequency tokens. In machine translation, we find that across multiple tokenizers, the R\'enyi entropy with $\alpha…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
