Self-Critical Reasoning for Robust Visual Question Answering

Jialin Wu; Raymond J. Mooney

arXiv:1905.09998·cs.CV·January 1, 2020·91 cites

Self-Critical Reasoning for Robust Visual Question Answering

Jialin Wu, Raymond J. Mooney

PDF

Open Access 1 Repo

TL;DR

This paper proposes a self-critical training method for VQA systems that improves their ability to generalize by aligning visual explanations with influential image regions, leading to state-of-the-art results on the VQA-CP dataset.

Contribution

It introduces a novel self-critical training objective that enhances VQA model robustness by leveraging visual explanations, either human-annotated or automatically derived.

Findings

01

Achieved 49.5% accuracy with textual explanations.

02

Achieved 48.5% accuracy with automatic region annotations.

03

Improved generalization on the VQA-CP dataset.

Abstract

Visual Question Answering (VQA) deep-learning systems tend to capture superficial statistical correlations in the training data because of strong language priors and fail to generalize to test data with a significantly different question-answer (QA) distribution. To address this issue, we introduce a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates. The influential regions are either determined from human visual/textual explanations or automatically from just significant words in the question and answer. We evaluate our approach on the VQA generalization task using the VQA-CP dataset, achieving a new state-of-the-art i.e., 49.5% using textual explanations and 48.5% using automatically annotated regions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jialinwu17/Self_Critical_VQA
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques