Robust Visual Question Answering: Datasets, Methods, and Future Challenges
Jie Ma, Pinghui Wang, Dechen Kong, Zewei Wang, Jun Liu, Hongbin Pei,, Junzhou Zhao

TL;DR
This paper surveys the development of datasets, evaluation metrics, and debiasing methods for improving the robustness of visual question answering systems, highlighting challenges and future directions.
Contribution
It provides the first comprehensive review of datasets, evaluation metrics, and debiasing techniques for VQA robustness, including analysis of pre-training models.
Findings
Debiasing methods improve out-of-distribution performance
Vision-and-language pre-training models show varied robustness
Future research should focus on grounding and bias mitigation
Abstract
Visual question answering requires a system to provide an accurate natural language answer given an image and a natural language question. However, it is widely recognized that previous generic VQA methods often exhibit a tendency to memorize biases present in the training data rather than learning proper behaviors, such as grounding images before predicting answers. Therefore, these methods usually achieve high in-distribution but poor out-of-distribution performance. In recent years, various datasets and debiasing methods have been proposed to evaluate and enhance the VQA robustness, respectively. This paper provides the first comprehensive survey focused on this emerging fashion. Specifically, we first provide an overview of the development process of datasets from in-distribution and out-of-distribution perspectives. Then, we examine the evaluation metrics employed by these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
