On the Power of Convolution Augmented Transformer
Mingchen Li, Xuechen Zhang, Yixiao Huang, Samet Oymak

TL;DR
This paper introduces Convolution-Augmented Transformer (CAT), which combines convolutional filters with attention to improve recall, copying, and length generalization in language models, with theoretical guarantees and empirical validation.
Contribution
The paper presents a novel CAT architecture that provably solves associative recall and copying tasks with a single layer and guarantees length generalization, bridging local and global modeling.
Findings
CAT can solve recall and copying tasks with a single layer.
CAT guarantees length generalization in language modeling.
Empirical results show improved performance on real datasets.
Abstract
The transformer architecture has catalyzed revolutionary advances in language modeling. However, recent architectural recipes, such as state-space models, have bridged the performance gap. Motivated by this, we examine the benefits of Convolution-Augmented Transformer (CAT) for recall, copying, and length generalization tasks. CAT incorporates convolutional filters in the K/Q/V embeddings of an attention layer. Through CAT, we show that the locality of the convolution synergizes with the global view of the attention. Unlike comparable architectures, such as Mamba or transformer, CAT can provably solve the associative recall (AR) and copying tasks using a single layer while also enjoying guaranteed length generalization. We also establish computational tradeoffs between convolution and attention by characterizing how convolution can mitigate the need for full attention by summarizing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Analog and Mixed-Signal Circuit Design · Sensor Technology and Measurement Systems
MethodsLinear Layer · Multi-Head Attention · Attention Is All You Need · Softmax · Byte Pair Encoding · Layer Normalization · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam
