Interpretable Visual Question Answering via Reasoning Supervision

Maria Parelli; Dimitrios Mallis; Markos Diomataris; Vassilis; Pitsikalis

arXiv:2309.03726·cs.CV·September 8, 2023

Interpretable Visual Question Answering via Reasoning Supervision

Maria Parelli, Dimitrios Mallis, Markos Diomataris, Vassilis, Pitsikalis

PDF

Open Access

TL;DR

This paper introduces a novel VQA model that uses reasoning supervision from textual justifications to improve visual grounding and performance without needing explicit grounding annotations.

Contribution

The work proposes a new architecture that leverages reasoning supervision from textual justifications to enhance visual grounding in VQA models.

Findings

01

Improved visual perception and reasoning in VQA models.

02

Enhanced performance on VQA tasks without explicit grounding annotations.

03

Qualitative evidence of better visual attention alignment.

Abstract

Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task. However, such models are likely to disregard crucial visual cues and often rely on multimodal shortcuts and inherent biases of the language modality to predict the correct answer, a phenomenon commonly referred to as lack of visual grounding. In this work, we alleviate this shortcoming through a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal. Reasoning supervision takes the form of a textual justification of the correct answer, with such annotations being already available on large-scale Visual Common Sense Reasoning (VCR) datasets. The model's visual attention is guided toward important elements of the scene through a similarity loss that aligns the learned attention distributions guided by the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning