Diformer: Directional Transformer for Neural Machine Translation
Minghan Wang, Jiaxin Guo, Yuxia Wang, Daimeng Wei, Hengchao Shang,, Chang Su, Yimeng Chen, Yinglu Li, Min Zhang, Shimin Tao, Hao Yang

TL;DR
Diformer introduces a unified directional transformer model that jointly captures autoregressive and non-autoregressive translation directions, improving performance and latency in neural machine translation.
Contribution
It proposes a novel framework that preserves original training objectives of AR and NAR models by modeling three generation directions with a direction variable.
Findings
Outperforms existing unified models by over 1.5 BLEU points on WMT benchmarks.
Achieves competitive results with state-of-the-art independent AR and NAR models.
Effectively combines AR and NAR advantages in a single model.
Abstract
Autoregressive (AR) and Non-autoregressive (NAR) models have their own superiority on the performance and latency, combining them into one model may take advantage of both. Current combination frameworks focus more on the integration of multiple decoding paradigms with a unified generative model, e.g. Masked Language Model. However, the generalization can be harmful to the performance due to the gap between training objective and inference. In this paper, we aim to close the gap by preserving the original objective of AR and NAR under a unified framework. Specifically, we propose the Directional Transformer (Diformer) by jointly modelling AR and NAR into three generation directions (left-to-right, right-to-left and straight) with a newly introduced direction variable, which works by controlling the prediction of each token to have specific dependencies under that direction. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Absolute Position Encodings · Residual Connection · Dropout
