Wide Attention Is The Way Forward For Transformers?
Jason Ross Brown, Yiren Zhao, Ilia Shumailov, Robert D Mullins

TL;DR
This paper demonstrates that wide, shallow Transformer models can outperform deeper ones in NLP tasks, offering benefits in speed, memory, and interpretability, challenging the traditional emphasis on depth.
Contribution
It introduces and systematically evaluates the effectiveness of wide, shallow Transformer architectures as a superior alternative to deep models in NLP.
Findings
Wide models perform 0.3% better than deep models on average across tasks.
Single layer Transformers are 3.1x faster on CPU inference.
Wide models require less memory and are more interpretable.
Abstract
The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative design approach that is building wider attention Transformers. We demonstrate that wide single layer Transformer models can compete with or outperform deeper ones in a variety of Natural Language Processing (NLP) tasks when both are trained from scratch. The impact of changing the model aspect ratio on Transformers is then studied systematically. This ratio balances the number of layers and the number of attention heads per layer while keeping the total number of attention heads and all other hyperparameters constant. On average, across 4 NLP tasks and 10 attention types, single layer wide models perform 0.3% better than their deep counterparts. We show an in-depth evaluation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Adam · Dense Connections · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Layer Normalization
