The Volctrans GLAT System: Non-autoregressive Translation Meets WMT21
Lihua Qian, Yi Zhou, Zaixiang Zheng, Yaoming Zhu, Zehui Lin, Jiangtao, Feng, Shanbo Cheng, Lei Li, Mingxuan Wang, Hao Zhou

TL;DR
This paper introduces a non-autoregressive translation system based on the Glancing Transformer, achieving state-of-the-art BLEU scores in German-English translation at WMT21, with faster decoding than autoregressive models.
Contribution
First practical non-autoregressive translation system scaled for WMT competition, outperforming autoregressive models in BLEU score.
Findings
Achieved BLEU score of 35.0 on German-English translation
Enabled fast parallel decoding with high accuracy
Outperformed all strong autoregressive models in the WMT21 task
Abstract
This paper describes the Volctrans' submission to the WMT21 news translation shared task for German->English translation. We build a parallel (i.e., non-autoregressive) translation system using the Glancing Transformer, which enables fast and accurate parallel decoding in contrast to the currently prevailing autoregressive models. To the best of our knowledge, this is the first parallel translation system that can be scaled to such a practical scenario like WMT competition. More importantly, our parallel translation system achieves the best BLEU score (35.0) on German->English translation task, outperforming all strong autoregressive counterparts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Label Smoothing · Residual Connection · Layer Normalization · Position-Wise Feed-Forward Layer · Adam
