Understanding the Role of Scene Graphs in Visual Question Answering

Vinay Damodaran; Sharanya Chakravarthy; Akshay Kumar; Anjana Umapathy,; Teruko Mitamura; Yuta Nakashima; Noa Garcia; Chenhui Chu

arXiv:2101.05479·cs.CV·January 19, 2021·22 cites

Understanding the Role of Scene Graphs in Visual Question Answering

Vinay Damodaran, Sharanya Chakravarthy, Akshay Kumar, Anjana Umapathy,, Teruko Mitamura, Yuta Nakashima, Noa Garcia, Chenhui Chu

PDF

Open Access

TL;DR

This paper investigates how scene graphs can enhance Visual Question Answering by evaluating various generation techniques, training strategies, and fusion architectures on the challenging GQA dataset, pioneering comprehensive analysis in this area.

Contribution

It introduces the first extensive study on integrating scene graphs into VQA, evaluating multiple techniques and proposing a training curriculum and fusion architectures.

Findings

01

Scene graphs improve VQA performance on complex questions.

02

Auto-generated scene graphs can be effectively used alongside human annotations.

03

Late fusion architectures enhance the integration of multiple image representations.

Abstract

Visual Question Answering (VQA) is of tremendous interest to the research community with important applications such as aiding visually impaired users and image-based search. In this work, we explore the use of scene graphs for solving the VQA task. We conduct experiments on the GQA dataset which presents a challenging set of questions requiring counting, compositionality and advanced reasoning capability, and provides scene graphs for a large number of images. We adopt image + question architectures for use with scene graphs, evaluate various scene graph generation techniques for unseen images, propose a training curriculum to leverage human-annotated and auto-generated scene graphs, and build late fusion architectures to learn from multiple image representations. We present a multi-faceted study into the use of scene graphs for VQA, making this work the first of its kind.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning