Towards Understanding the Effectiveness of Attention Mechanism
Xiang Ye, Zihang He, Heng Wang, Yong Li

TL;DR
This paper investigates the true source of attention mechanisms' effectiveness in CNNs, revealing the crucial role of feature map multiplication in regularization and performance enhancement, beyond traditional visual attention explanations.
Contribution
It uncovers the importance of feature map multiplication in attention mechanisms and introduces FMMNet, a new network that outperforms ResNet by replacing addition with multiplication.
Findings
Feature map multiplication contributes to smoother, more stable learned landscapes.
FMMNet outperforms ResNet on various datasets.
Attention weights have weak correlation with feature importance.
Abstract
Attention Mechanism is a widely used method for improving the performance of convolutional neural networks (CNNs) on computer vision tasks. Despite its pervasiveness, we have a poor understanding of what its effectiveness stems from. It is popularly believed that its effectiveness stems from the visual attention explanation, advocating focusing on the important part of input data rather than ingesting the entire input. In this paper, we find that there is only a weak consistency between the attention weights of features and their importance. Instead, we verify the crucial role of feature map multiplication in attention mechanism and uncover a fundamental impact of feature map multiplication on the learned landscapes of CNNs: with the high order non-linearity brought by the feature map multiplication, it played a regularization role on CNNs, which made them learn smoother and more stable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Brain Tumor Detection and Classification
Methods1x1 Convolution · Convolution · Batch Normalization · Residual Connection · *Communicated@Fast*How Do I Communicate to Expedia? · Average Pooling · Global Average Pooling · Bottleneck Residual Block · Kaiming Initialization · Residual Block
