Self-Critical Reasoning for Robust Visual Question Answering
Jialin Wu, Raymond J. Mooney

TL;DR
This paper proposes a self-critical training method for VQA systems that improves their ability to generalize by aligning visual explanations with influential image regions, leading to state-of-the-art results on the VQA-CP dataset.
Contribution
It introduces a novel self-critical training objective that enhances VQA model robustness by leveraging visual explanations, either human-annotated or automatically derived.
Findings
Achieved 49.5% accuracy with textual explanations.
Achieved 48.5% accuracy with automatic region annotations.
Improved generalization on the VQA-CP dataset.
Abstract
Visual Question Answering (VQA) deep-learning systems tend to capture superficial statistical correlations in the training data because of strong language priors and fail to generalize to test data with a significantly different question-answer (QA) distribution. To address this issue, we introduce a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates. The influential regions are either determined from human visual/textual explanations or automatically from just significant words in the question and answer. We evaluate our approach on the VQA generalization task using the VQA-CP dataset, achieving a new state-of-the-art i.e., 49.5% using textual explanations and 48.5% using automatically annotated regions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
