Coarse-to-Fine Reasoning for Visual Question Answering

Binh X. Nguyen; Tuong Do; Huy Tran; Erman Tjiputra; Quang D. Tran; Anh; Nguyen

arXiv:2110.02526·cs.CV·April 20, 2022·1 cites

Coarse-to-Fine Reasoning for Visual Question Answering

Binh X. Nguyen, Tuong Do, Huy Tran, Erman Tjiputra, Quang D. Tran, Anh, Nguyen

PDF

Open Access 2 Repos

TL;DR

This paper introduces a coarse-to-fine reasoning framework for VQA that effectively integrates features and predicates at multiple semantic levels, improving accuracy and interpretability.

Contribution

It proposes a novel joint learning framework that bridges the semantic gap in VQA by combining features and predicates in a coarse-to-fine manner.

Findings

01

Achieves superior accuracy on three large-scale VQA datasets.

02

Provides an explainable reasoning process for VQA predictions.

03

Outperforms existing state-of-the-art methods.

Abstract

Bridging the semantic gap between image and question is an important step to improve the accuracy of the Visual Question Answering (VQA) task. However, most of the existing VQA methods focus on attention mechanisms or visual relations for reasoning the answer, while the features at different semantic levels are not fully utilized. In this paper, we present a new reasoning framework to fill the gap between visual features and semantic clues in the VQA task. Our method first extracts the features and predicates from the image and question. We then propose a new reasoning framework to effectively jointly learn these features and predicates in a coarse-to-fine manner. The intensively experimental results on three large-scale VQA datasets show that our proposed approach achieves superior accuracy comparing with other state-of-the-art methods. Furthermore, our reasoning framework also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning