Convolutional Sequence to Sequence Learning

Jonas Gehring; Michael Auli; David Grangier; Denis Yarats; Yann N.; Dauphin

arXiv:1705.03122·cs.CL·July 26, 2017·1.9k cites

Convolutional Sequence to Sequence Learning

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N., Dauphin

PDF

Open Access 5 Repos

TL;DR

This paper introduces a convolutional neural network architecture for sequence to sequence learning, enabling fully parallel computation, easier optimization, and achieving superior translation accuracy faster than traditional recurrent models.

Contribution

The authors present a novel convolutional architecture with gated linear units and attention mechanisms, outperforming LSTM-based models in translation tasks.

Findings

01

Outperforms deep LSTM on WMT translation benchmarks

02

Enables fully parallelized training and inference

03

Achieves faster speed on GPU and CPU

Abstract

The prevalent approach to sequence to sequence learning maps an input sequence to a variable length output sequence via recurrent neural networks. We introduce an architecture based entirely on convolutional neural networks. Compared to recurrent models, computations over all elements can be fully parallelized during training and optimization is easier since the number of non-linearities is fixed and independent of the input length. Our use of gated linear units eases gradient propagation and we equip each decoder layer with a separate attention module. We outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT'14 English-French translation at an order of magnitude faster speed, both on GPU and CPU.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Algorithms and Data Compression

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory