Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering
Medhini Narasimhan, Svetlana Lazebnik, Alexander G. Schwing

TL;DR
This paper introduces a graph convolutional network approach for factual visual question answering, enabling joint reasoning over entities and facts, which improves accuracy on the FVQA dataset.
Contribution
The paper proposes using an entity graph and graph convolutional networks for reasoning in FVQA, surpassing previous methods that consider facts sequentially.
Findings
Achieved approximately 7% accuracy improvement over state-of-the-art methods.
Demonstrated the effectiveness of joint reasoning over entities in FVQA.
Validated approach on the challenging FVQA dataset.
Abstract
Accurately answering a question about a given image requires combining observations with general knowledge. While this is effortless for humans, reasoning with general knowledge remains an algorithmic challenge. To advance research in this direction a novel `fact-based' visual question answering (FVQA) task has been introduced recently along with a large set of curated facts which link two entities, i.e., two possible answers, via a relation. Given a question-image pair, deep network techniques have been employed to successively reduce the large set of facts until one of the two entities of the final remaining fact is predicted as the answer. We observe that a successive process which considers one fact at a time to form a local decision is sub-optimal. Instead, we develop an entity graph and use a graph convolutional network to `reason' about the correct answer by jointly considering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
