Bottom-Up and Top-Down Attention for Image Captioning and Visual   Question Answering

Peter Anderson; Xiaodong He; Chris Buehler; Damien Teney; Mark; Johnson; Stephen Gould; Lei Zhang

arXiv:1707.07998·cs.CV·March 15, 2018·94 cites

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark, Johnson, Stephen Gould, Lei Zhang

PDF

Open Access 5 Repos

TL;DR

This paper introduces a combined bottom-up and top-down attention mechanism for image captioning and VQA, focusing attention on salient regions and objects to improve understanding and performance.

Contribution

The paper presents a novel attention approach that integrates bottom-up object proposals with top-down weighting, advancing image captioning and VQA performance.

Findings

01

Achieved state-of-the-art results on MSCOCO for image captioning.

02

Secured first place in the 2017 VQA Challenge.

03

Demonstrated broad applicability of the attention mechanism.

Abstract

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning