TL;DR
This paper proposes a method to enhance neural machine translation by extracting dense linguistic information from BERT, leading to better generalization without complicating training.
Contribution
It introduces a novel approach to incorporate fine-tuned BERT embeddings into NMT, improving translation quality and training stability.
Findings
Improved translation performance across various datasets.
Enhanced generalization in low-resource settings.
No additional training complexity introduced.
Abstract
Adding linguistic information (syntax or semantics) to neural machine translation (NMT) has mostly focused on using point estimates from pre-trained models. Directly using the capacity of massive pre-trained contextual word embedding models such as BERT (Devlin et al., 2019) has been marginally useful in NMT because effective fine-tuning is difficult to obtain for NMT without making training brittle and unreliable. We augment NMT by extracting dense fine-tuned vector-based linguistic information from BERT instead of using point estimates. Experimental results show that our method of incorporating linguistic information helps NMT to generalize better in a variety of training contexts and is no more difficult to train than conventional Transformer-based NMT.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Weight Decay · WordPiece · Linear Warmup With Linear Decay · Residual Connection · Layer Normalization · Adam · Dropout · Multi-Head Attention · Attention Dropout
