Long-Short Term Masking Transformer: A Simple but Effective Baseline for Document-level Neural Machine Translation
Pei Zhang, Boxing Chen, Niyu Ge, Kai Fan

TL;DR
This paper introduces a simple long-short term masking self-attention mechanism for document-level neural machine translation, improving long-range dependency modeling and reducing error propagation with strong BLEU scores.
Contribution
It proposes a novel masking strategy on the standard transformer to enhance document-level translation without increasing model complexity.
Findings
Achieves strong BLEU scores on two datasets
Effectively captures discourse phenomena
Reduces error propagation in translation
Abstract
Many document-level neural machine translation (NMT) systems have explored the utility of context-aware architecture, usually requiring an increasing number of parameters and computational complexity. However, few attention is paid to the baseline model. In this paper, we research extensively the pros and cons of the standard transformer in document-level translation, and find that the auto-regressive property can simultaneously bring both the advantage of the consistency and the disadvantage of error accumulation. Therefore, we propose a surprisingly simple long-short term masking self-attention on top of the standard transformer to both effectively capture the long-range dependence and reduce the propagation of errors. We examine our approach on the two publicly available document-level datasets. We can achieve a strong result in BLEU and capture discourse phenomena.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
