Towards Making the Most of BERT in Neural Machine Translation

Jiacheng Yang; Mingxuan Wang; Hao Zhou; Chengqi Zhao; Yong Yu; Weinan; Zhang; Lei Li

arXiv:1908.05672·cs.CL·June 22, 2022·31 cites

Towards Making the Most of BERT in Neural Machine Translation

Jiacheng Yang, Mingxuan Wang, Hao Zhou, Chengqi Zhao, Yong Yu, Weinan, Zhang, Lei Li

PDF

Open Access 2 Repos

TL;DR

This paper introduces CTNMT, a training framework that effectively integrates pre-trained language models like BERT into neural machine translation, improving translation quality and surpassing previous state-of-the-art results.

Contribution

The paper proposes a novel concerted training framework with techniques to retain pre-trained knowledge and prevent catastrophic forgetting in NMT models.

Findings

01

Up to 3 BLEU score improvement on WMT14 English-German

02

Surpasses previous state-of-the-art pre-training aided NMT by 1.4 BLEU

03

Significant improvements on large-scale English-French translation

Abstract

GPT-2 and BERT demonstrate the effectiveness of using pre-trained language models (LMs) on various natural language processing tasks. However, LM fine-tuning often suffers from catastrophic forgetting when applied to resource-rich tasks. In this work, we introduce a concerted training framework (CTNMT) that is the key to integrate the pre-trained LMs to neural machine translation (NMT). Our proposed CTNMT consists of three techniques: a) asymptotic distillation to ensure that the NMT model can retain the previous pre-trained knowledge; b) a dynamic switching gate to avoid catastrophic forgetting of pre-trained knowledge; and c) a strategy to adjust the learning paces according to a scheduled policy. Our experiments in machine translation show CTNMT gains of up to 3 BLEU score on the WMT14 English-German language pair which even surpasses the previous state-of-the-art pre-training aided…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections