Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models
Tyler A. Chang, Yifan Xu, Weijian Xu, and Zhuowen Tu

TL;DR
This paper reveals the equivalence between relative position embeddings in self-attention and dynamic lightweight convolutions, proposing new convolutional methods to enhance Transformer models for natural language processing tasks.
Contribution
It introduces composite attention unifying relative position embeddings with convolutions and demonstrates their effectiveness in improving BERT's performance across tasks.
Findings
Convolutions improve downstream task performance.
Relative position embeddings are equivalent to dynamic lightweight convolutions.
Different convolution types and injection points affect model pre-training.
Abstract
In this paper, we detail the relationship between convolutions and self-attention in natural language tasks. We show that relative position embeddings in self-attention layers are equivalent to recently-proposed dynamic lightweight convolutions, and we consider multiple new ways of integrating convolutions into Transformer self-attention. Specifically, we propose composite attention, which unites previous relative position embedding methods under a convolutional framework. We conduct experiments by training BERT with composite attention, finding that convolutions consistently improve performance on multiple downstream tasks, replacing absolute position embeddings. To inform future work, we present results comparing lightweight convolutions, dynamic convolutions, and depthwise-separable convolutions in language model pre-training, considering multiple injection points for convolutions in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Label Smoothing · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Residual Connection
