Multilingual BERT Post-Pretraining Alignment

Lin Pan; Chung-Wei Hang; Haode Qi; Abhishek Shah; Saloni Potdar; Mo Yu

arXiv:2010.12547·cs.CL·April 13, 2021·5 cites

Multilingual BERT Post-Pretraining Alignment

Lin Pan, Chung-Wei Hang, Haode Qi, Abhishek Shah, Saloni Potdar, Mo Yu

PDF

Open Access

TL;DR

This paper introduces a post-pretraining alignment method for multilingual BERT that enhances zero-shot cross-lingual transfer by aligning embeddings at word and sentence levels using parallel data, contrastive learning, and code-switching.

Contribution

It presents a simple, effective post-pretraining alignment technique that improves multilingual transferability with less data and fewer parameters than existing models.

Findings

01

Improves zero-shot XNLI accuracy by 4.7% over mBERT.

02

Outperforms larger XLM-R_Base on MLQA.

03

Achieves comparable results to XLM with less data and fewer parameters.

Abstract

We propose a simple method to align multilingual contextual embeddings as a post-pretraining step for improved zero-shot cross-lingual transferability of the pretrained models. Using parallel data, our method aligns embeddings on the word level through the recently proposed Translation Language Modeling objective as well as on the sentence level via contrastive learning and random input shuffling. We also perform sentence-level code-switching with English when finetuning on downstream tasks. On XNLI, our best model (initialized from mBERT) improves over mBERT by 4.7% in the zero-shot setting and achieves comparable result to XLM for translate-train while using less than 18% of the same parallel data and 31% less model parameters. On MLQA, our model outperforms XLM-R_Base that has 57% more parameters than ours.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsLinear Layer · mBERT · Contrastive Learning · Attention Dropout · Residual Connection · Attention Is All You Need · Byte Pair Encoding · Adam · Softmax · Layer Normalization