Bridging Cross-Lingual Gaps During Leveraging the Multilingual   Sequence-to-Sequence Pretraining for Text Generation and Understanding

Changtong Zan; Liang Ding; Li Shen; Yu Cao; Weifeng Liu; Dacheng Tao

arXiv:2204.07834·cs.CL·September 22, 2022·1 cites

Bridging Cross-Lingual Gaps During Leveraging the Multilingual Sequence-to-Sequence Pretraining for Text Generation and Understanding

Changtong Zan, Liang Ding, Li Shen, Yu Cao, Weifeng Liu, Dacheng Tao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a code-switching restore task during pretraining to better align multilingual models with downstream tasks, improving cross-lingual performance in text generation and understanding.

Contribution

It proposes a novel pretraining approach with code-switching restore to bridge domain and task gaps in multilingual Seq2Seq models.

Findings

01

Outperforms mBART baseline on translation and summarization tasks.

02

Reduces Euclidean distance of cross-lingual sentence representations.

03

Enhances model generalization with minimal computational overhead.

Abstract

For multilingual sequence-to-sequence pretrained language models (multilingual Seq2Seq PLMs), e.g. mBART, the self-supervised pretraining task is trained on a wide range of monolingual languages, e.g. 25 languages from CommonCrawl, while the downstream cross-lingual tasks generally progress on a bilingual language subset, e.g. English-German, making there exists the data discrepancy, namely domain discrepancy, and cross-lingual learning objective discrepancy, namely task discrepancy, between the pretraining and finetuning stages. To bridge the above cross-lingual domain and task gaps, we extend the vanilla pretrain-finetune pipeline with extra code-switching restore task. Specifically, the first stage employs the self-supervised code-switching restore task as a pretext task, allowing the multilingual Seq2Seq PLMs to acquire some in-domain alignment information. And for the second stage,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zanchangtong/csr4mbart
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · mBART · Sequence to Sequence