Are Pre-trained Convolutions Better than Pre-trained Transformers?

Yi Tay; Mostafa Dehghani; Jai Gupta; Dara Bahri; Vamsi Aribandi; Zhen; Qin; Donald Metzler

arXiv:2105.03322·cs.CL·February 1, 2022·33 cites

Are Pre-trained Convolutions Better than Pre-trained Transformers?

Yi Tay, Mostafa Dehghani, Jai Gupta, Dara Bahri, Vamsi Aribandi, Zhen, Qin, Donald Metzler

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper compares pre-trained convolutional models to Transformers in language tasks, finding CNNs can be competitive and sometimes outperform Transformers, challenging the focus on Transformers alone.

Contribution

It provides the first extensive empirical comparison of pre-trained CNNs and Transformers, highlighting the potential of CNNs in NLP.

Findings

01

CNN-based pre-trained models are competitive with Transformers.

02

CNNs can outperform Transformers in certain NLP scenarios.

03

Pre-training and architecture advances should be considered independently.

Abstract

In the era of pre-trained language models, Transformers are the de facto choice of model architectures. While recent research has shown promise in entirely convolutional, or CNN, architectures, they have not been explored using the pre-train-fine-tune paradigm. In the context of language models, are convolutional models competitive to Transformers when pre-trained? This paper investigates this research question and presents several interesting findings. Across an extensive set of experiments on 8 datasets/tasks, we find that CNN-based pre-trained models are competitive and outperform their Transformer counterpart in certain scenarios, albeit with caveats. Overall, the findings outlined in this paper suggest that conflating pre-training and architectural advances is misguided and that both advances should be considered independently. We believe our research paves the way for a healthy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/google-research
tfOfficial

Videos

Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained· youtube

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications

MethodsMulti-Head Attention · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Residual Connection · Dropout · Softmax · Layer Normalization · Label Smoothing