BERT-JAM: Boosting BERT-Enhanced Neural Machine Translation with Joint   Attention

Zhebin Zhang; Sai Wu; Dawei Jiang; Gang Chen

arXiv:2011.04266·cs.CL·November 10, 2020·1 cites

BERT-JAM: Boosting BERT-Enhanced Neural Machine Translation with Joint Attention

Zhebin Zhang, Sai Wu, Dawei Jiang, Gang Chen

PDF

Open Access

TL;DR

BERT-JAM introduces a flexible joint-attention mechanism and intermediate representation utilization in BERT-enhanced neural machine translation, leading to state-of-the-art translation performance.

Contribution

The paper presents BERT-JAM, a novel NMT model that dynamically allocates attention and leverages intermediate BERT representations, improving translation quality.

Findings

01

Achieves state-of-the-art BLEU scores on multiple translation tasks.

02

Demonstrates effective dynamic attention distribution between representations.

03

Utilizes a three-phase training strategy for optimal performance.

Abstract

BERT-enhanced neural machine translation (NMT) aims at leveraging BERT-encoded representations for translation tasks. A recently proposed approach uses attention mechanisms to fuse Transformer's encoder and decoder layers with BERT's last-layer representation and shows enhanced performance. However, their method doesn't allow for the flexible distribution of attention between the BERT representation and the encoder/decoder representation. In this work, we propose a novel BERT-enhanced NMT model called BERT-JAM which improves upon existing models from two aspects: 1) BERT-JAM uses joint-attention modules to allow the encoder/decoder layers to dynamically allocate attention between different representations, and 2) BERT-JAM allows the encoder/decoder layers to make use of BERT's intermediate representations by composing them using a gated linear unit (GLU). We train BERT-JAM with a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsLinear Layer · Attention Dropout · Dropout · Softmax · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dense Connections · WordPiece · Linear Warmup With Linear Decay