Pay Less Attention with Lightweight and Dynamic Convolutions
Felix Wu, Angela Fan, Alexei Baevski, Yann N. Dauphin, Michael Auli

TL;DR
This paper introduces dynamic convolutions as a lightweight, efficient alternative to self-attention for language and image models, achieving competitive or superior results in translation, language modeling, and summarization.
Contribution
It presents dynamic convolutions that are simpler and more efficient than self-attention, with linear scaling and state-of-the-art performance in multiple tasks.
Findings
Dynamic convolutions outperform self-attention in large-scale translation.
Achieve a new state-of-the-art 29.7 BLEU on WMT'14 English-German.
Convolution-based models are competitive with or better than self-attention models.
Abstract
Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements. The number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic. Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT'14 English-German test set dynamic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
MethodsGated Linear Unit · Linear Layer · Dynamic Convolution · DropConnect · Depthwise Convolution · Softmax · Lightweight Convolution · Convolution
