Interpretable Counting for Visual Question Answering

Alexander Trott; Caiming Xiong; Richard Socher

arXiv:1712.08697·cs.AI·March 5, 2018·20 cites

Interpretable Counting for Visual Question Answering

Alexander Trott, Caiming Xiong, Richard Socher

PDF

Open Access

TL;DR

This paper introduces an interpretable, sequential decision-based model for counting objects in images to improve visual question answering, providing more accurate and grounded answers.

Contribution

The paper presents a novel counting approach that treats counting as a sequential decision process, enhancing interpretability and outperforming existing methods in VQA tasks.

Findings

01

Outperforms state-of-the-art in counting accuracy

02

Provides interpretable, grounded counting outputs

03

Effective in complex visual question answering scenarios

Abstract

Questions that require counting a variety of objects in images remain a major challenge in visual question answering (VQA). The most common approaches to VQA involve either classifying answers based on fixed length representations of both the image and question or summing fractional counts estimated from each section of the image. In contrast, we treat counting as a sequential decision process and force our model to make discrete choices of what to count. Specifically, the model sequentially selects from detected objects and learns interactions between objects that influence subsequent selections. A distinction of our approach is its intuitive and interpretable output, as discrete counts are automatically grounded in the image. Furthermore, our method outperforms the state of the art architecture for VQA on multiple metrics that evaluate counting.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques