BERT-JAM: Boosting BERT-Enhanced Neural Machine Translation with Joint Attention
Zhebin Zhang, Sai Wu, Dawei Jiang, Gang Chen

TL;DR
BERT-JAM introduces a flexible joint-attention mechanism and intermediate representation utilization in BERT-enhanced neural machine translation, leading to state-of-the-art translation performance.
Contribution
The paper presents BERT-JAM, a novel NMT model that dynamically allocates attention and leverages intermediate BERT representations, improving translation quality.
Findings
Achieves state-of-the-art BLEU scores on multiple translation tasks.
Demonstrates effective dynamic attention distribution between representations.
Utilizes a three-phase training strategy for optimal performance.
Abstract
BERT-enhanced neural machine translation (NMT) aims at leveraging BERT-encoded representations for translation tasks. A recently proposed approach uses attention mechanisms to fuse Transformer's encoder and decoder layers with BERT's last-layer representation and shows enhanced performance. However, their method doesn't allow for the flexible distribution of attention between the BERT representation and the encoder/decoder representation. In this work, we propose a novel BERT-enhanced NMT model called BERT-JAM which improves upon existing models from two aspects: 1) BERT-JAM uses joint-attention modules to allow the encoder/decoder layers to dynamically allocate attention between different representations, and 2) BERT-JAM allows the encoder/decoder layers to make use of BERT's intermediate representations by composing them using a gated linear unit (GLU). We train BERT-JAM with a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsLinear Layer · Attention Dropout · Dropout · Softmax · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dense Connections · WordPiece · Linear Warmup With Linear Decay
