nmT5 -- Is parallel data still relevant for pre-training massively   multilingual language models?

Mihir Kale; Aditya Siddhant; Noah Constant; Melvin Johnson; Rami; Al-Rfou; Linting Xue

arXiv:2106.02171·cs.CL·June 7, 2021·1 cites

nmT5 -- Is parallel data still relevant for pre-training massively multilingual language models?

Mihir Kale, Aditya Siddhant, Noah Constant, Melvin Johnson, Rami, Al-Rfou, Linting Xue

PDF

Open Access

TL;DR

This paper examines the relevance of parallel data in pre-training large multilingual models like mT5, finding it beneficial for smaller models and low-resource scenarios, but less so as model size increases.

Contribution

It demonstrates that incorporating parallel data during pre-training improves performance, especially for smaller models and limited data situations, challenging the necessity of parallel data for large models.

Findings

01

Parallel data benefits small and low-resource models.

02

Diminishing returns of parallel data as model size grows.

03

Pre-training with parallel data aids low-resource language tasks.

Abstract

Recently, mT5 - a massively multilingual version of T5 - leveraged a unified text-to-text format to attain state-of-the-art results on a wide variety of multilingual NLP tasks. In this paper, we investigate the impact of incorporating parallel data into mT5 pre-training. We find that multi-tasking language modeling with objectives such as machine translation during pre-training is a straightforward way to improve performance on downstream multilingual and cross-lingual tasks. However, the gains start to diminish as the model capacity increases, suggesting that parallel data might not be as essential for larger models. At the same time, even at larger model sizes, we find that pre-training with parallel data still provides benefits in the limited labelled data regime.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsAttention Is All You Need · Linear Layer · mT5 · Multi-Head Attention · Dropout · Byte Pair Encoding · Gated Linear Unit · Layer Normalization · Inverse Square Root Schedule · Refunds@Expedia|||How do I get a full refund from Expedia?