Spatial Attention as an Interface for Image Captioning Models
Philipp Sadler

TL;DR
This paper explores translating spatial attention mechanisms in image captioning models into natural language to better understand their internal visual functions, demonstrating that attention manipulations influence generated captions and can reveal model focus.
Contribution
It introduces methods to interpret spatial attention in image captioning models as natural language interfaces, linking attention to model behavior and extending to visual question answering networks.
Findings
Model reacts to attention modifications with up to 52.65% change in caption content.
Generated captions include object details in 9% of cases when attention is manipulated.
Captions reflect question-answer details in 55.20% of cases at the word level.
Abstract
The internal workings of modern deep learning models stay often unclear to an external observer, although spatial attention mechanisms are involved. The idea of this work is to translate these spatial attentions into natural language to provide a simpler access to the model's function. Thus, I took a neural image captioning model and measured the reactions to external modification in its spatial attention for three different interface methods: a fixation over the whole generation process, a fixation for the first time-steps and an addition to the generator's attention. The experimental results for bounding box based spatial attention vectors have shown that the captioning model reacts to method dependent changes in up to 52.65% and includes in 9.00% of the cases object categories, which were otherwise unmentioned. Afterwards, I established such a link to a hierarchical co-attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
