Beyond Literal Token Overlap: Token Alignability for Multilinguality
Katharina H\"ammerl, Tomasz Limisiewicz, Jind\v{r}ich Libovick\'y,, Alexander Fraser

TL;DR
This paper introduces subword token alignability as a new metric to better predict multilinguality and cross-lingual transfer, especially for language pairs with different scripts, improving upon traditional token overlap measures.
Contribution
The paper proposes subword token alignability as a novel metric that enhances understanding of multilinguality beyond literal token overlap, particularly for languages with different scripts.
Findings
Subword token alignability predicts multilinguality better than token overlap.
The metric is effective for both encoder and decoder models.
It can guide the selection of language pairs and improve multilingual tokenisation.
Abstract
Previous work has considered token overlap, or even similarity of token distributions, as predictors for multilinguality and cross-lingual knowledge transfer in language models. However, these very literal metrics assign large distances to language pairs with different scripts, which can nevertheless show good cross-linguality. This limits the explanatory strength of token overlap for knowledge transfer between language pairs that use distinct scripts or follow different orthographic conventions. In this paper, we propose subword token alignability as a new way to understand the impact and quality of multilingual tokenisation. In particular, this metric predicts multilinguality much better when scripts are disparate and the overlap of literal tokens is low. We analyse this metric in the context of both encoder and decoder models, look at data size as a potential distractor, and discuss…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLanguage, Linguistics, Cultural Analysis
