Multilingual Translation via Grafting Pre-trained Language Models

Zewei Sun; Mingxuan Wang; Lei Li

arXiv:2109.05256·cs.CL·September 14, 2021

Multilingual Translation via Grafting Pre-trained Language Models

Zewei Sun, Mingxuan Wang, Lei Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces Graformer, a method to combine pre-trained language models for different languages to improve multilingual translation, leveraging monolingual and parallel data for enhanced performance.

Contribution

It proposes a novel approach to graft pre-trained language models for different languages, enabling effective multilingual translation without training from scratch.

Findings

01

Achieves 5.8 BLEU improvement in x2en translation

02

Achieves 2.9 BLEU improvement in en2x translation

03

Demonstrates effectiveness on 60 translation directions

Abstract

Can pre-trained BERT for one language and GPT for another be glued together to translate texts? Self-supervised training using only monolingual data has led to the success of pre-trained (masked) language models in many NLP tasks. However, directly connecting BERT as an encoder and GPT as a decoder can be challenging in machine translation, for GPT-like models lack a cross-attention component that is needed in seq2seq decoders. In this paper, we propose Graformer to graft separately pre-trained (masked) language models for machine translation. With monolingual data for pre-training and parallel data for grafting training, we maximally take advantage of the usage of both types of data. Experiments on 60 directions show that our method achieves average improvements of 5.8 BLEU in x2en and 2.9 BLEU in en2x directions comparing with the multilingual Transformer of the same size.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sunzewei2715/Graformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Sigmoid Activation · Tanh Activation · Cosine Annealing · Linear Warmup With Linear Decay · Attention Dropout