Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision
Chenyang Huang, Hao Zhou, Osmar R. Za\"iane, Lili Mou, Lei Li

TL;DR
This paper introduces DSLP, a non-autoregressive translation model with deep supervision and layer-wise predictions, achieving high translation quality and efficiency, even surpassing autoregressive models on some tasks.
Contribution
The paper proposes a novel non-autoregressive Transformer with deep supervision and layer-wise predictions, significantly improving translation quality and inference speed.
Findings
Outperforms base models in BLEU scores across four translation tasks.
Achieves 14.8 times faster inference than autoregressive models.
Outperforms autoregressive models on three translation tasks.
Abstract
How do we perform efficient inference while retaining high translation quality? Existing neural machine translation models, such as Transformer, achieve high performance, but they decode words one by one, which is inefficient. Recent non-autoregressive translation models speed up the inference, but their quality is still inferior. In this work, we propose DSLP, a highly efficient and high-performance model for machine translation. The key insight is to train a non-autoregressive Transformer with Deep Supervision and feed additional Layer-wise Predictions. We conducted extensive experiments on four translation tasks (both directions of WMT'14 EN-DE and WMT'16 EN-RO). Results show that our approach consistently improves the BLEU scores compared with respective base models. Specifically, our best variant outperforms the autoregressive model on three translation tasks, while being 14.8…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Softmax · Residual Connection · Adam · Label Smoothing · Byte Pair Encoding
