Non-Autoregressive Machine Translation with Disentangled Context Transformer
Jungo Kasai, James Cross, Marjan Ghazvininejad, Jiatao Gu

TL;DR
This paper introduces the DisCo transformer, a non-autoregressive machine translation model that generates all tokens simultaneously using disentangled contexts, leading to faster inference with competitive translation quality.
Contribution
The paper proposes a novel attention-masking model and inference algorithm for non-autoregressive translation, enabling parallel token generation and improved decoding speed.
Findings
Achieves comparable or better translation quality than autoregressive models.
Significantly reduces decoding time across multiple translation tasks.
Demonstrates effectiveness on 7 translation directions with various data sizes.
Abstract
State-of-the-art neural machine translation models generate a translation from left to right and every step is conditioned on the previously generated tokens. The sequential nature of this generation process causes fundamental latency in inference since we cannot generate multiple tokens in each sentence in parallel. We propose an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts. The DisCo transformer is trained to predict every output token given an arbitrary subset of the other reference tokens. We also develop the parallel easy-first inference algorithm, which iteratively refines every token in parallel and reduces the number of required iterations. Our extensive experiments on 7 translation directions with varying data sizes demonstrate that our model achieves competitive, if not better,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
