Multilingual Translation via Grafting Pre-trained Language Models
Zewei Sun, Mingxuan Wang, Lei Li

TL;DR
This paper introduces Graformer, a method to combine pre-trained language models for different languages to improve multilingual translation, leveraging monolingual and parallel data for enhanced performance.
Contribution
It proposes a novel approach to graft pre-trained language models for different languages, enabling effective multilingual translation without training from scratch.
Findings
Achieves 5.8 BLEU improvement in x2en translation
Achieves 2.9 BLEU improvement in en2x translation
Demonstrates effectiveness on 60 translation directions
Abstract
Can pre-trained BERT for one language and GPT for another be glued together to translate texts? Self-supervised training using only monolingual data has led to the success of pre-trained (masked) language models in many NLP tasks. However, directly connecting BERT as an encoder and GPT as a decoder can be challenging in machine translation, for GPT-like models lack a cross-attention component that is needed in seq2seq decoders. In this paper, we propose Graformer to graft separately pre-trained (masked) language models for machine translation. With monolingual data for pre-training and parallel data for grafting training, we maximally take advantage of the usage of both types of data. Experiments on 60 directions show that our method achieves average improvements of 5.8 BLEU in x2en and 2.9 BLEU in en2x directions comparing with the multilingual Transformer of the same size.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Sigmoid Activation · Tanh Activation · Cosine Annealing · Linear Warmup With Linear Decay · Attention Dropout
