Token Drop mechanism for Neural Machine Translation

Huaao Zhang; Shigui Qiu; Xiangyu Duan; Min Zhang

arXiv:2010.11018·cs.CL·October 22, 2020

Token Drop mechanism for Neural Machine Translation

Huaao Zhang, Shigui Qiu, Xiangyu Duan, Min Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Token Drop, a novel regularization technique for neural machine translation that replaces dropped tokens with a special token and uses self-supervised objectives to enhance model generalization.

Contribution

The paper proposes Token Drop and two self-supervised objectives to improve NMT generalization and reduce overfitting, demonstrating effectiveness on multiple language benchmarks.

Findings

01

Significant improvements over Transformer baseline.

02

Enhanced model robustness to unfamiliar inputs.

03

Effective in Chinese-English and English-Romanian translation tasks.

Abstract

Neural machine translation with millions of parameters is vulnerable to unfamiliar inputs. We propose Token Drop to improve generalization and avoid overfitting for the NMT model. Similar to word dropout, whereas we replace dropped token with a special token instead of setting zero to words. We further introduce two self-supervised objectives: Replaced Token Detection and Dropped Token Prediction. Our method aims to force model generating target translation with less information, in this way the model can learn textual representation better. Experiments on Chinese-English and English-Romanian benchmark demonstrate the effectiveness of our approach and our model achieves significant improvements over a strong Transformer baseline.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhajiahe/Token_Drop
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax · Adam · Layer Normalization · Dense Connections · Multi-Head Attention · Label Smoothing