Visual Robustness Benchmark for Visual Question Answering (VQA)
Md Farhan Ishmam, Ishmam Tashdeed, Talukder Asir Saadat, Md Hamjajul, Ashmafee, Abu Raihan Mostofa Kamal, Md. Azam Hossain

TL;DR
This paper introduces a large-scale benchmark to evaluate the visual robustness of VQA models against realistic image corruptions, revealing insights into the trade-offs between model size, accuracy, and robustness.
Contribution
It presents the first comprehensive benchmark with augmented images and new evaluation metrics for assessing visual robustness in VQA models.
Findings
Larger models tend to be more robust to visual corruptions.
There is a trade-off between model performance and robustness.
Current models vary significantly in their resilience to image corruptions.
Abstract
Can Visual Question Answering (VQA) systems perform just as well when deployed in the real world? Or are they susceptible to realistic corruption effects e.g. image blur, which can be detrimental in sensitive applications, such as medical VQA? While linguistic or textual robustness has been thoroughly explored in the VQA literature, there has yet to be any significant work on the visual robustness of VQA models. We propose the first large-scale benchmark comprising 213,000 augmented images, challenging the visual robustness of multiple VQA models and assessing the strength of realistic visual corruptions. Additionally, we have designed several robustness evaluation metrics that can be aggregated into a unified metric and tailored to fit a variety of use cases. Our experiments reveal several insights into the relationships between model size, performance, and robustness with the visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsBLIP: Bootstrapping Language-Image Pre-training · Vision-and-Language Transformer
