MonoByte: A Pool of Monolingual Byte-level Language Models
Hugo Abonizio, Leandro Rodrigues de Souza, Roberto Lotufo, Rodrigo, Nogueira

TL;DR
This paper introduces MonoByte, a set of monolingual byte-level language models pretrained under consistent conditions, enabling more reliable cross-lingual transferability studies without tokenization biases.
Contribution
The authors release 10 monolingual byte-level models pretrained with a large compute budget and larger corpora, providing a standardized resource for cross-lingual research.
Findings
Monolingual byte-level models perform competitively with multilingual models on QA and NLI tasks.
Eliminating tokenization issues allows for broader cross-lingual experiments.
Models pretrained on non-natural texts serve as sanity checks.
Abstract
The zero-shot cross-lingual ability of models pretrained on multilingual and even monolingual corpora has spurred many hypotheses to explain this intriguing empirical result. However, due to the costs of pretraining, most research uses public models whose pretraining methodology, such as the choice of tokenization, corpus size, and computational budget, might differ drastically. When researchers pretrain their own models, they often do so under a constrained budget, and the resulting models might underperform significantly compared to SOTA models. These experimental differences led to various inconsistent conclusions about the nature of the cross-lingual ability of these models. To help further research on the topic, we released 10 monolingual byte-level models rigorously pretrained under the same configuration with a large compute budget (equivalent to 420 days on a V100) and corpora…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
