PARADISE: Exploiting Parallel Data for Multilingual Sequence-to-Sequence Pretraining
Machel Reid, Mikel Artetxe

TL;DR
PARADISE introduces a novel pretraining method that leverages parallel data in multilingual sequence-to-sequence models, significantly improving translation and inference performance while reducing computational costs.
Contribution
It extends denoising pretraining by incorporating parallel data through dictionary-based word replacement and translation prediction, enhancing multilingual model training.
Findings
Improves BLEU scores by 2.0 on average
Increases cross-lingual inference accuracy by 6.7 points
Achieves competitive results with less computational cost
Abstract
Despite the success of multilingual sequence-to-sequence pretraining, most existing approaches rely on monolingual corpora, and do not make use of the strong cross-lingual signal contained in parallel data. In this paper, we present PARADISE (PARAllel & Denoising Integration in SEquence-to-sequence models), which extends the conventional denoising objective used to train these models by (i) replacing words in the noised sequence according to a multilingual dictionary, and (ii) predicting the reference translation according to a parallel corpus instead of recovering the original sequence. Our experiments on machine translation and cross-lingual natural language inference show an average improvement of 2.0 BLEU points and 6.7 accuracy points from integrating parallel data into pretraining, respectively, obtaining results that are competitive with several popular models at a fraction of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
