Optimizing Non-Autoregressive Transformers with Contrastive Learning
Chenxin An, Jiangtao Feng, Fei Huang, Xipeng Qiu, Lingpeng Kong

TL;DR
This paper introduces a contrastive learning approach to improve non-autoregressive Transformers, significantly enhancing their performance across multiple NLP tasks and benchmarks by stabilizing training and better modeling data distribution.
Contribution
It proposes a novel contrastive learning method integrated with NATs to address multi-modality learning challenges, achieving state-of-the-art results.
Findings
Outperforms previous NAT baselines significantly.
Achieves new state-of-the-art results on all benchmarks.
Effective across machine translation, summarization, and paraphrasing.
Abstract
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order. They have achieved remarkable progress in machine translation as well as many other applications. However, a long-standing challenge for NATs is the learning of multi-modality data distribution, which is the main cause of the performance gap between NATs and ATs. In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution. We derive contrastive constraints to stabilize the training process and integrate this resulting objective with the state-of-the-art NAT architecture DA-Transformer. Our model \method is examined on 3 different tasks, including machine translation, text summarization, and paraphrasing with 5 benchmarks. Results show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
