TL;DR
Graphhopper introduces a multi-hop scene graph reasoning approach for visual question answering, combining knowledge graph navigation with reinforcement learning to improve reasoning accuracy on complex image questions.
Contribution
It presents a novel multi-hop reasoning method using reinforcement learning over scene graphs for VQA, achieving competitive performance with state-of-the-art models.
Findings
Performs on par with human accuracy on manually curated scene graphs.
Outperforms existing scene graph reasoning models on GQA dataset.
Effective in both manually curated and automatically generated scene graphs.
Abstract
Visual Question Answering (VQA) is concerned with answering free-form questions about an image. Since it requires a deep semantic and linguistic understanding of the question and the ability to associate it with various objects that are present in the image, it is an ambitious task and requires multi-modal reasoning from both computer vision and natural language processing. We propose Graphhopper, a novel method that approaches the task by integrating knowledge graph reasoning, computer vision, and natural language processing techniques. Concretely, our method is based on performing context-driven, sequential reasoning based on the scene entities and their semantic and spatial relationships. As a first step, we derive a scene graph that describes the objects in the image, as well as their attributes and their mutual relationships. Subsequently, a reinforcement learning agent is trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
