Normalized and Geometry-Aware Self-Attention Network for Image   Captioning

Longteng Guo; Jing Liu; Xinxin Zhu; Peng Yao; Shichen Lu; and Hanqing; Lu

arXiv:2003.08897·cs.CV·March 20, 2020·26 cites

Normalized and Geometry-Aware Self-Attention Network for Image Captioning

Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Shichen Lu, and Hanqing, Lu

PDF

Open Access 1 Video

TL;DR

This paper introduces a normalized and geometry-aware self-attention mechanism for image captioning, improving performance by integrating normalization inside self-attention and explicitly modeling object geometry relations.

Contribution

The paper proposes Normalized Self-Attention with normalization inside the attention and Geometry-aware Self-Attention to incorporate spatial relations, enhancing image captioning models.

Findings

01

Achieved superior results on MS-COCO dataset.

02

Demonstrated generality across video captioning, machine translation, and VQA.

03

Improved modeling of object geometry relations.

Abstract

Self-attention (SA) network has shown profound value in image captioning. In this paper, we improve SA from two aspects to promote the performance of image captioning. First, we propose Normalized Self-Attention (NSA), a reparameterization of SA that brings the benefits of normalization inside SA. While normalization is previously only applied outside SA, we introduce a novel normalization method and demonstrate that it is both possible and beneficial to perform it on the hidden activations inside SA. Second, to compensate for the major limit of Transformer that it fails to model the geometry structure of the input objects, we propose a class of Geometry-aware Self-Attention (GSA) that extends SA to explicitly and efficiently consider the relative geometry relations between the objects in the image. To construct our image captioning model, we combine the two modules and apply it to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Normalized and Geometry-Aware Self-Attention Network for Image Captioning· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax