Human-Adversarial Visual Question Answering
Sasha Sheng, Amanpreet Singh, Vedanuj Goswami, Jose Alberto Lopez, Magana, Wojciech Galuba, Devi Parikh, Douwe Kiela

TL;DR
This paper introduces AdVQA, a benchmark using human-adversarial examples to evaluate and challenge state-of-the-art VQA models, revealing their vulnerabilities and guiding future improvements.
Contribution
It presents a novel adversarial benchmark for VQA, created through human interaction to find questions that fool current models, highlighting their weaknesses.
Findings
Most state-of-the-art models perform poorly on adversarial examples.
Adversarial examples reveal specific weaknesses in current VQA models.
The benchmark provides insights for future research directions.
Abstract
Performance on the most commonly used Visual Question Answering dataset (VQA v2) is starting to approach human accuracy. However, in interacting with state-of-the-art VQA models, it is clear that the problem is far from being solved. In order to stress test VQA models, we benchmark them against human-adversarial examples. Human subjects interact with a state-of-the-art VQA model, and for each image in the dataset, attempt to find a question where the model's predicted answer is incorrect. We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples. We conduct an extensive analysis of the collected adversarial examples and provide guidance on future research directions. We hope that this Adversarial VQA (AdVQA) benchmark can help drive progress in the field and advance the state of the art.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
