How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
Phillip Rust, Jonas Pfeiffer, Ivan Vuli\'c, Sebastian Ruder, Iryna, Gurevych

TL;DR
This paper systematically compares multilingual and monolingual pretrained language models across nine diverse languages and five tasks, revealing the significant impact of tokenizer choice on monolingual performance.
Contribution
It provides a controlled empirical analysis showing the importance of monolingual tokenizers and data size in multilingual model performance, with new insights into tokenizer effects.
Findings
Monolingual tokenizers improve multilingual model performance.
Language representation in vocabulary affects performance gap.
Tokenizer replacement enhances downstream task results.
Abstract
In this work, we provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first aim to establish, via fair and controlled comparisons, if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for any performance difference. To disentangle conflating factors, we train new monolingual models on the same data, with monolingually and multilingually trained tokenizers. We find that while the pretraining data size is an important factor, a designated monolingual tokenizer plays an equally important role in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗cmarkea/bloomz-7b1-mt-sft-chatmodel· 773 dl· ♡ 16773 dl♡ 16
- 🤗cmarkea/bloomz-3b-sft-chatmodel· 805 dl· ♡ 12805 dl♡ 12
- 🤗cmarkea/bloomz-560m-sft-chatmodel· 923 dl· ♡ 10923 dl♡ 10
- 🤗RichardErkhov/cmarkea_-_bloomz-560m-sft-chat-4bitsmodel· 1 dl1 dl
- 🤗RichardErkhov/cmarkea_-_bloomz-560m-sft-chat-8bitsmodel· 1 dl1 dl
- 🤗RichardErkhov/cmarkea_-_bloomz-7b1-mt-sft-chat-ggufmodel· 24 dl24 dl
- 🤗RichardErkhov/cmarkea_-_bloomz-3b-sft-chat-ggufmodel· 27 dl27 dl
- 🤗RichardErkhov/cmarkea_-_bloomz-3b-sft-chat-4bitsmodel· 1 dl1 dl
- 🤗RichardErkhov/cmarkea_-_bloomz-3b-sft-chat-8bitsmodel
- 🤗RichardErkhov/cmarkea_-_bloomz-560m-sft-chat-ggufmodel· 117 dl117 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
