Areas of Attention for Image Captioning
Marco Pedersoli, Thomas Lucas, Cordelia Schmid, Jakob Verbeek

TL;DR
This paper introduces a novel attention-based model for image captioning that models dependencies between image regions, caption words, and RNN states, leading to improved localization and state-of-the-art results.
Contribution
It proposes a new attention mechanism that directly associates caption words with image regions and compares different methods for generating attention areas, including spatial transformers.
Findings
Spatial transformers outperform other methods for attention areas.
The model achieves state-of-the-art results on MSCOCO.
Associating caption words with image regions improves caption accuracy.
Abstract
We propose "Areas of Attention", a novel attention-based model for automatic image captioning. Our approach models the dependencies between image regions, caption words, and the state of an RNN language model, using three pairwise interactions. In contrast to previous attention-based approaches that associate image regions only to the RNN state, our method allows a direct association between caption words and image regions. During training these associations are inferred from image-level captions, akin to weakly-supervised object detector training. These associations help to improve captioning by localizing the corresponding regions during testing. We also propose and compare different ways of generating attention areas: CNN activation grids, object proposals, and spatial transformers nets applied in a convolutional fashion. Spatial transformers give the best results. They allow for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsSpatial Transformer
