Robust Visual Question Answering: Datasets, Methods, and Future   Challenges

Jie Ma; Pinghui Wang; Dechen Kong; Zewei Wang; Jun Liu; Hongbin Pei,; Junzhou Zhao

arXiv:2307.11471·cs.CV·February 20, 2024·2 cites

Robust Visual Question Answering: Datasets, Methods, and Future Challenges

Jie Ma, Pinghui Wang, Dechen Kong, Zewei Wang, Jun Liu, Hongbin Pei,, Junzhou Zhao

PDF

Open Access

TL;DR

This paper surveys the development of datasets, evaluation metrics, and debiasing methods for improving the robustness of visual question answering systems, highlighting challenges and future directions.

Contribution

It provides the first comprehensive review of datasets, evaluation metrics, and debiasing techniques for VQA robustness, including analysis of pre-training models.

Findings

01

Debiasing methods improve out-of-distribution performance

02

Vision-and-language pre-training models show varied robustness

03

Future research should focus on grounding and bias mitigation

Abstract

Visual question answering requires a system to provide an accurate natural language answer given an image and a natural language question. However, it is widely recognized that previous generic VQA methods often exhibit a tendency to memorize biases present in the training data rather than learning proper behaviors, such as grounding images before predicting answers. Therefore, these methods usually achieve high in-distribution but poor out-of-distribution performance. In recent years, various datasets and debiasing methods have been proposed to evaluate and enhance the VQA robustness, respectively. This paper provides the first comprehensive survey focused on this emerging fashion. Specifically, we first provide an overview of the development process of datasets from in-distribution and out-of-distribution perspectives. Then, we examine the evaluation metrics employed by these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques