# Attention Augmented Convolutional Networks

**Authors:** Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, Quoc V. Le

arXiv: 1904.09925 · 2020-09-11

## TL;DR

This paper introduces a novel self-attention mechanism for image classification that, when combined with convolutions, improves accuracy on ImageNet and COCO datasets without increasing model complexity.

## Contribution

It proposes a two-dimensional relative self-attention mechanism and demonstrates its effectiveness when augmenting convolutional networks for visual tasks.

## Key findings

- Achieves 1.3% top-1 accuracy improvement on ImageNet with ResNet50.
- Outperforms other attention mechanisms like Squeeze-and-Excitation.
- Improves COCO object detection mAP by 1.4 using RetinaNet.

## Abstract

Convolutional networks have been the paradigm of choice in many computer vision applications. The convolution operation however has a significant weakness in that it only operates on a local neighborhood, thus missing global information. Self-attention, on the other hand, has emerged as a recent advance to capture long range interactions, but has mostly been applied to sequence modeling and generative modeling tasks. In this paper, we consider the use of self-attention for discriminative visual tasks as an alternative to convolutions. We introduce a novel two-dimensional relative self-attention mechanism that proves competitive in replacing convolutions as a stand-alone computational primitive for image classification. We find in control experiments that the best results are obtained when combining both convolutions and self-attention. We therefore propose to augment convolutional operators with this self-attention mechanism by concatenating convolutional feature maps with a set of feature maps produced via self-attention. Extensive experiments show that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar. In particular, our method achieves a $1.3\%$ top-1 accuracy improvement on ImageNet classification over a ResNet50 baseline and outperforms other attention mechanisms for images such as Squeeze-and-Excitation. It also achieves an improvement of 1.4 mAP in COCO Object Detection on top of a RetinaNet baseline.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.09925/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/1904.09925/full.md

## References

55 references — full list in the complete paper: https://tomesphere.com/paper/1904.09925/full.md

---
Source: https://tomesphere.com/paper/1904.09925