Interpretable Visual Question Answering by Visual Grounding from   Attention Supervision Mining

Yundong Zhang; Juan Carlos Niebles; Alvaro Soto

arXiv:1808.00265·cs.CV·August 2, 2018·1 cites

Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

Yundong Zhang, Juan Carlos Niebles, Alvaro Soto

PDF

Open Access

TL;DR

This paper presents a method for training interpretable VQA models using automatically mined visual grounding supervision, achieving high correlation with manual annotations and state-of-the-art accuracy without expensive human annotations.

Contribution

It introduces a novel approach to train VQA models with automatically obtained grounding supervision from region descriptions and object annotations.

Findings

01

Generated groundings have higher correlation with manual annotations.

02

Achieved state-of-the-art VQA accuracy.

03

Reduced reliance on costly human-annotated groundings.

Abstract

A key aspect of VQA models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve a higher correlation with respect to manually-annotated groundings, meanwhile achieving state-of-the-art VQA accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning