VQA Therapy: Exploring Answer Differences by Visually Grounding Answers

Chongyan Chen; Samreen Anjum; Danna Gurari

arXiv:2308.11662·cs.CV·August 29, 2023

VQA Therapy: Exploring Answer Differences by Visually Grounding Answers

Chongyan Chen, Samreen Anjum, Danna Gurari

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces VQAAnswerTherapy, a dataset that visually grounds each unique answer to a visual question, and explores problems of predicting answer grounding consistency and localization, revealing strengths and weaknesses of current algorithms.

Contribution

It presents the first dataset with answer groundings for visual questions and proposes new problems for predicting and localizing answer groundings.

Findings

01

Modern algorithms vary in success across tasks

02

The dataset reveals specific challenges in grounding answers

03

Benchmark results highlight areas for improvement

Abstract

Visual question answering is a task of predicting the answer to a question about an image. Given that different people can provide different answers to a visual question, we aim to better understand why with answer groundings. We introduce the first dataset that visually grounds each unique answer to each visual question, which we call VQAAnswerTherapy. We then propose two novel problems of predicting whether a visual question has a single answer grounding and localizing all answer groundings. We benchmark modern algorithms for these novel problems to show where they succeed and struggle. The dataset and evaluation server can be found publicly at https://vizwiz.org/tasks-and-datasets/vqa-answer-therapy/.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ccychongyanchen/vqatherapycrowdsourcing
noneOfficial

Videos

VQA Therapy: Exploring Answer Differences by Visually Grounding Answers· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques