An Empirical Study of Spatial Attention Mechanisms in Deep Networks
Xizhou Zhu, Dazhi Cheng, Zheng Zhang, Stephen Lin, Jifeng Dai

TL;DR
This paper empirically investigates how various spatial attention components influence deep neural network performance across different models and tasks, revealing insights that challenge conventional beliefs and suggest avenues for improvement.
Contribution
It systematically ablates and compares spatial attention elements in a unified framework, encompassing Transformer, deformable, and dynamic convolutions, providing new understanding of their roles.
Findings
Query-key comparison is negligible in self-attention but crucial in encoder-decoder attention.
Combining deformable convolution with key-only saliency yields optimal accuracy-efficiency balance.
There is significant potential for enhancing attention mechanism designs.
Abstract
Attention mechanisms have become a popular component in deep neural networks, yet there has been little examination of how different influencing factors and methods for computing attention from these factors affect performance. Toward a better general understanding of attention mechanisms, we present an empirical study that ablates various spatial attention elements within a generalized attention formulation, encompassing the dominant Transformer attention as well as the prevalent deformable convolution and dynamic convolution modules. Conducted on a variety of applications, the study yields significant findings about spatial attention in deep networks, some of which run counter to conventional understanding. For example, we find that the query and key content comparison in Transformer attention is negligible for self-attention, but vital for encoder-decoder attention. A proper…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
