Adventurer's Treasure Hunt: A Transparent System for Visually Grounded   Compositional Visual Question Answering based on Scene Graphs

Daniel Reich; Felix Putze; Tanja Schultz

arXiv:2106.14476·cs.CV·June 29, 2021

Adventurer's Treasure Hunt: A Transparent System for Visually Grounded Compositional Visual Question Answering based on Scene Graphs

Daniel Reich, Felix Putze, Tanja Schultz

PDF

Open Access

TL;DR

This paper introduces 'Adventurer's Treasure Hunt', a modular, transparent system for compositional visual question answering that explicitly models reasoning paths, quantifies component impacts, and dynamically queries visual knowledge bases, achieving high grounding accuracy.

Contribution

The paper presents ATH, a novel VQA system that enhances transparency, visual grounding, and dynamic answer extraction, outperforming existing models on the GQA dataset.

Findings

01

Achieves highest visual grounding score among examined systems.

02

Explicitly quantifies impact of each component on performance.

03

Models reasoning as a visual treasure hunt with interpretable inference paths.

Abstract

With the expressed goal of improving system transparency and visual grounding in the reasoning process in VQA, we present a modular system for the task of compositional VQA based on scene graphs. Our system is called "Adventurer's Treasure Hunt" (or ATH), named after an analogy we draw between our model's search procedure for an answer and an adventurer's search for treasure. We developed ATH with three characteristic features in mind: 1. By design, ATH allows us to explicitly quantify the impact of each of the sub-components on overall VQA performance, as well as their performance on their individual sub-task. 2. By modeling the search task after a treasure hunt, ATH inherently produces an explicit, visually grounded inference path for the processed question. 3. ATH is the first GQA-trained VQA system that dynamically extracts answers by querying the visual knowledge base directly,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition