Unknown Script: Impact of Script on Cross-Lingual Transfer
Wondimagegnhue Tsegaye Tufa, Ilia Markov, Piek Vossen

TL;DR
This paper investigates how the choice of tokenizer influences cross-lingual transfer performance, especially when the target language's script is not included in the pre-trained model, highlighting tokenizer importance over script similarity.
Contribution
It demonstrates that tokenizer choice significantly impacts cross-lingual transfer, surpassing script similarity and language relatedness, in models with unseen scripts.
Findings
Tokenizer choice is a key factor in transfer performance.
Shared script is less influential than tokenizer.
Model size has minimal effect when tokenizer is optimized.
Abstract
Cross-lingual transfer has become an effective way of transferring knowledge between languages. In this paper, we explore an often overlooked aspect in this domain: the influence of the source language of a language model on language transfer performance. We consider a case where the target language and its script are not part of the pre-trained model. We conduct a series of experiments on monolingual and multilingual models that are pre-trained on different tokenization methods to determine factors that affect cross-lingual transfer to a new language with a unique script. Our findings reveal the importance of the tokenizer as a stronger factor than the shared script, language similarity, and model size.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices · Second Language Acquisition and Learning
MethodsBalanced Selection
