nmT5 -- Is parallel data still relevant for pre-training massively multilingual language models?
Mihir Kale, Aditya Siddhant, Noah Constant, Melvin Johnson, Rami, Al-Rfou, Linting Xue

TL;DR
This paper examines the relevance of parallel data in pre-training large multilingual models like mT5, finding it beneficial for smaller models and low-resource scenarios, but less so as model size increases.
Contribution
It demonstrates that incorporating parallel data during pre-training improves performance, especially for smaller models and limited data situations, challenging the necessity of parallel data for large models.
Findings
Parallel data benefits small and low-resource models.
Diminishing returns of parallel data as model size grows.
Pre-training with parallel data aids low-resource language tasks.
Abstract
Recently, mT5 - a massively multilingual version of T5 - leveraged a unified text-to-text format to attain state-of-the-art results on a wide variety of multilingual NLP tasks. In this paper, we investigate the impact of incorporating parallel data into mT5 pre-training. We find that multi-tasking language modeling with objectives such as machine translation during pre-training is a straightforward way to improve performance on downstream multilingual and cross-lingual tasks. However, the gains start to diminish as the model capacity increases, suggesting that parallel data might not be as essential for larger models. At the same time, even at larger model sizes, we find that pre-training with parallel data still provides benefits in the limited labelled data regime.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsAttention Is All You Need · Linear Layer · mT5 · Multi-Head Attention · Dropout · Byte Pair Encoding · Gated Linear Unit · Layer Normalization · Inverse Square Root Schedule · Refunds@Expedia|||How do I get a full refund from Expedia?
