Spatial Attention as an Interface for Image Captioning Models

Philipp Sadler

arXiv:2010.11701·cs.CV·October 23, 2020

Spatial Attention as an Interface for Image Captioning Models

Philipp Sadler

PDF

Open Access

TL;DR

This paper explores translating spatial attention mechanisms in image captioning models into natural language to better understand their internal visual functions, demonstrating that attention manipulations influence generated captions and can reveal model focus.

Contribution

It introduces methods to interpret spatial attention in image captioning models as natural language interfaces, linking attention to model behavior and extending to visual question answering networks.

Findings

01

Model reacts to attention modifications with up to 52.65% change in caption content.

02

Generated captions include object details in 9% of cases when attention is manipulated.

03

Captions reflect question-answer details in 55.20% of cases at the word level.

Abstract

The internal workings of modern deep learning models stay often unclear to an external observer, although spatial attention mechanisms are involved. The idea of this work is to translate these spatial attentions into natural language to provide a simpler access to the model's function. Thus, I took a neural image captioning model and measured the reactions to external modification in its spatial attention for three different interface methods: a fixation over the whole generation process, a fixation for the first time-steps and an addition to the generator's attention. The experimental results for bounding box based spatial attention vectors have shown that the captioning model reacts to method dependent changes in up to 52.65% and includes in 9.00% of the cases object categories, which were otherwise unmentioned. Afterwards, I established such a link to a hierarchical co-attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection