How Good is Your Tokenizer? On the Monolingual Performance of   Multilingual Language Models

Phillip Rust; Jonas Pfeiffer; Ivan Vuli\'c; Sebastian Ruder; Iryna; Gurevych

arXiv:2012.15613·cs.CL·June 3, 2021

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Phillip Rust, Jonas Pfeiffer, Ivan Vuli\'c, Sebastian Ruder, Iryna, Gurevych

PDF

1 Repo 10 Models 1 Datasets

TL;DR

This paper systematically compares multilingual and monolingual pretrained language models across nine diverse languages and five tasks, revealing the significant impact of tokenizer choice on monolingual performance.

Contribution

It provides a controlled empirical analysis showing the importance of monolingual tokenizers and data size in multilingual model performance, with new insights into tokenizer effects.

Findings

01

Monolingual tokenizers improve multilingual model performance.

02

Language representation in vocabulary affects performance gap.

03

Tokenizer replacement enhances downstream task results.

Abstract

In this work, we provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first aim to establish, via fair and controlled comparisons, if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for any performance difference. To disentangle conflating factors, we train new monolingual models on the same data, with monolingually and multilingually trained tokenizers. We find that while the pretraining data size is an important factor, a designated monolingual tokenizer plays an equally important role in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Adapter-Hub/hgiyt
pytorchOfficial

Models

Datasets

occiglot/tokenizer-wiki-bench
dataset· 24k dl
24k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.