Coarse-to-Fine Reasoning for Visual Question Answering
Binh X. Nguyen, Tuong Do, Huy Tran, Erman Tjiputra, Quang D. Tran, Anh, Nguyen

TL;DR
This paper introduces a coarse-to-fine reasoning framework for VQA that effectively integrates features and predicates at multiple semantic levels, improving accuracy and interpretability.
Contribution
It proposes a novel joint learning framework that bridges the semantic gap in VQA by combining features and predicates in a coarse-to-fine manner.
Findings
Achieves superior accuracy on three large-scale VQA datasets.
Provides an explainable reasoning process for VQA predictions.
Outperforms existing state-of-the-art methods.
Abstract
Bridging the semantic gap between image and question is an important step to improve the accuracy of the Visual Question Answering (VQA) task. However, most of the existing VQA methods focus on attention mechanisms or visual relations for reasoning the answer, while the features at different semantic levels are not fully utilized. In this paper, we present a new reasoning framework to fill the gap between visual features and semantic clues in the VQA task. Our method first extracts the features and predicates from the image and question. We then propose a new reasoning framework to effectively jointly learn these features and predicates in a coarse-to-fine manner. The intensively experimental results on three large-scale VQA datasets show that our proposed approach achieves superior accuracy comparing with other state-of-the-art methods. Furthermore, our reasoning framework also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
