The Volctrans GLAT System: Non-autoregressive Translation Meets WMT21

Lihua Qian; Yi Zhou; Zaixiang Zheng; Yaoming Zhu; Zehui Lin; Jiangtao; Feng; Shanbo Cheng; Lei Li; Mingxuan Wang; Hao Zhou

arXiv:2109.11247·cs.CL·September 27, 2021·6 cites

The Volctrans GLAT System: Non-autoregressive Translation Meets WMT21

Lihua Qian, Yi Zhou, Zaixiang Zheng, Yaoming Zhu, Zehui Lin, Jiangtao, Feng, Shanbo Cheng, Lei Li, Mingxuan Wang, Hao Zhou

PDF

Open Access

TL;DR

This paper introduces a non-autoregressive translation system based on the Glancing Transformer, achieving state-of-the-art BLEU scores in German-English translation at WMT21, with faster decoding than autoregressive models.

Contribution

First practical non-autoregressive translation system scaled for WMT competition, outperforming autoregressive models in BLEU score.

Findings

01

Achieved BLEU score of 35.0 on German-English translation

02

Enabled fast parallel decoding with high accuracy

03

Outperformed all strong autoregressive models in the WMT21 task

Abstract

This paper describes the Volctrans' submission to the WMT21 news translation shared task for German->English translation. We build a parallel (i.e., non-autoregressive) translation system using the Glancing Transformer, which enables fast and accurate parallel decoding in contrast to the currently prevailing autoregressive models. To the best of our knowledge, this is the first parallel translation system that can be scaled to such a practical scenario like WMT competition. More importantly, our parallel translation system achieves the best BLEU score (35.0) on German->English translation task, outperforming all strong autoregressive counterparts.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Label Smoothing · Residual Connection · Layer Normalization · Position-Wise Feed-Forward Layer · Adam