Are Pre-trained Convolutions Better than Pre-trained Transformers?
Yi Tay, Mostafa Dehghani, Jai Gupta, Dara Bahri, Vamsi Aribandi, Zhen, Qin, Donald Metzler

TL;DR
This paper compares pre-trained convolutional models to Transformers in language tasks, finding CNNs can be competitive and sometimes outperform Transformers, challenging the focus on Transformers alone.
Contribution
It provides the first extensive empirical comparison of pre-trained CNNs and Transformers, highlighting the potential of CNNs in NLP.
Findings
CNN-based pre-trained models are competitive with Transformers.
CNNs can outperform Transformers in certain NLP scenarios.
Pre-training and architecture advances should be considered independently.
Abstract
In the era of pre-trained language models, Transformers are the de facto choice of model architectures. While recent research has shown promise in entirely convolutional, or CNN, architectures, they have not been explored using the pre-train-fine-tune paradigm. In the context of language models, are convolutional models competitive to Transformers when pre-trained? This paper investigates this research question and presents several interesting findings. Across an extensive set of experiments on 8 datasets/tasks, we find that CNN-based pre-trained models are competitive and outperform their Transformer counterpart in certain scenarios, albeit with caveats. Overall, the findings outlined in this paper suggest that conflating pre-training and architectural advances is misguided and that both advances should be considered independently. We believe our research paves the way for a healthy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained· youtube
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications
MethodsMulti-Head Attention · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Residual Connection · Dropout · Softmax · Layer Normalization · Label Smoothing
