Evaluating Morphological Alignment of Tokenizers in 70 Languages

Catherine Arnett; Marisa Hudspeth; Brendan O'Connor

arXiv:2507.06378·cs.CL·July 10, 2025

Evaluating Morphological Alignment of Tokenizers in 70 Languages

Catherine Arnett, Marisa Hudspeth, Brendan O'Connor

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper expands the MorphScore metric to 70 languages to evaluate how well tokenizers align with morphological boundaries and investigates its correlation with language model performance, finding limited explanatory power.

Contribution

It extends MorphScore to support 70 languages, providing a more flexible evaluation of tokenizer morphological alignment and analyzing its relation to downstream task performance.

Findings

01

Morphological alignment scores do not strongly predict model performance.

02

Expanded MorphScore supports 70 languages, increasing evaluation coverage.

03

Morphological alignment alone is insufficient to measure tokenizer quality.

Abstract

While tokenization is a key step in language modeling, with effects on model training and performance, it remains unclear how to effectively evaluate tokenizer quality. One proposed dimension of tokenizer quality is the extent to which tokenizers preserve linguistically meaningful subwords, aligning token boundaries with morphological boundaries within a word. We expand MorphScore (Arnett & Bergen, 2025), which previously covered 22 languages, to support a total of 70 languages. The updated MorphScore offers more flexibility in evaluation and addresses some of the limitations of the original version. We then correlate our alignment scores with downstream task performance for five pre-trained languages models on seven tasks, with at least one task in each of the languages in our sample. We find that morphological alignment does not explain very much variance in model performance,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

catherinearnett/morphscore
noneOfficial

Datasets

catherinearnett/morphscore
dataset· 293 dl
293 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification