Better Neural Machine Translation by Extracting Linguistic Information   from BERT

Hassan S. Shavarani; Anoop Sarkar

arXiv:2104.02831·cs.CL·April 8, 2021

Better Neural Machine Translation by Extracting Linguistic Information from BERT

Hassan S. Shavarani, Anoop Sarkar

PDF

1 Repo

TL;DR

This paper proposes a method to enhance neural machine translation by extracting dense linguistic information from BERT, leading to better generalization without complicating training.

Contribution

It introduces a novel approach to incorporate fine-tuned BERT embeddings into NMT, improving translation quality and training stability.

Findings

01

Improved translation performance across various datasets.

02

Enhanced generalization in low-resource settings.

03

No additional training complexity introduced.

Abstract

Adding linguistic information (syntax or semantics) to neural machine translation (NMT) has mostly focused on using point estimates from pre-trained models. Directly using the capacity of massive pre-trained contextual word embedding models such as BERT (Devlin et al., 2019) has been marginally useful in NMT because effective fine-tuning is difficult to obtain for NMT without making training brittle and unreliable. We augment NMT by extracting dense fine-tuned vector-based linguistic information from BERT instead of using point estimates. Experimental results show that our method of incorporating linguistic information helps NMT to generalize better in a variety of training contexts and is no more difficult to train than conventional Transformer-based NMT.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sfu-natlang/SFUTranslate
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Weight Decay · WordPiece · Linear Warmup With Linear Decay · Residual Connection · Layer Normalization · Adam · Dropout · Multi-Head Attention · Attention Dropout