Normalized and Geometry-Aware Self-Attention Network for Image Captioning
Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Shichen Lu, and Hanqing, Lu

TL;DR
This paper introduces a normalized and geometry-aware self-attention mechanism for image captioning, improving performance by integrating normalization inside self-attention and explicitly modeling object geometry relations.
Contribution
The paper proposes Normalized Self-Attention with normalization inside the attention and Geometry-aware Self-Attention to incorporate spatial relations, enhancing image captioning models.
Findings
Achieved superior results on MS-COCO dataset.
Demonstrated generality across video captioning, machine translation, and VQA.
Improved modeling of object geometry relations.
Abstract
Self-attention (SA) network has shown profound value in image captioning. In this paper, we improve SA from two aspects to promote the performance of image captioning. First, we propose Normalized Self-Attention (NSA), a reparameterization of SA that brings the benefits of normalization inside SA. While normalization is previously only applied outside SA, we introduce a novel normalization method and demonstrate that it is both possible and beneficial to perform it on the hidden activations inside SA. Second, to compensate for the major limit of Transformer that it fails to model the geometry structure of the input objects, we propose a class of Geometry-aware Self-Attention (GSA) that extends SA to explicitly and efficiently consider the relative geometry relations between the objects in the image. To construct our image captioning model, we combine the two modules and apply it to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Normalized and Geometry-Aware Self-Attention Network for Image Captioning· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
