Convolutions and Self-Attention: Re-interpreting Relative Positions in   Pre-trained Language Models

Tyler A. Chang; Yifan Xu; Weijian Xu; and Zhuowen Tu

arXiv:2106.05505·cs.CL·June 11, 2021·1 cites

Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models

Tyler A. Chang, Yifan Xu, Weijian Xu, and Zhuowen Tu

PDF

Open Access 1 Repo

TL;DR

This paper reveals the equivalence between relative position embeddings in self-attention and dynamic lightweight convolutions, proposing new convolutional methods to enhance Transformer models for natural language processing tasks.

Contribution

It introduces composite attention unifying relative position embeddings with convolutions and demonstrates their effectiveness in improving BERT's performance across tasks.

Findings

01

Convolutions improve downstream task performance.

02

Relative position embeddings are equivalent to dynamic lightweight convolutions.

03

Different convolution types and injection points affect model pre-training.

Abstract

In this paper, we detail the relationship between convolutions and self-attention in natural language tasks. We show that relative position embeddings in self-attention layers are equivalent to recently-proposed dynamic lightweight convolutions, and we consider multiple new ways of integrating convolutions into Transformer self-attention. Specifically, we propose composite attention, which unites previous relative position embedding methods under a convolutional framework. We conduct experiments by training BERT with composite attention, finding that convolutions consistently improve performance on multiple downstream tasks, replacing absolute position embeddings. To inform future work, we present results comparing lightweight convolutions, dynamic convolutions, and depthwise-separable convolutions in language model pre-training, considering multiple injection points for convolutions in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mlpc-ucsd/BERT_Convolutions
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Label Smoothing · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Residual Connection