Beyond Accuracy: A Consolidated Tool for Visual Question Answering Benchmarking
Dirk V\"ath, Pascal Tilli, Ngoc Thang Vu

TL;DR
This paper introduces a comprehensive, browser-based benchmarking tool for Visual Question Answering that evaluates models on accuracy, robustness, biases, and uncertainty across multiple datasets, highlighting generalization challenges.
Contribution
The paper presents a new benchmarking tool for VQA that assesses models beyond accuracy, including robustness and bias, with easy integration and interactive analysis features.
Findings
State-of-the-art models lack generalization across datasets.
Metrics identify embeddings that improve robustness.
Models fail to recognize text in images despite dataset training.
Abstract
On the way towards general Visual Question Answering (VQA) systems that are able to answer arbitrary questions, the need arises for evaluation beyond single-metric leaderboards for specific datasets. To this end, we propose a browser-based benchmarking tool for researchers and challenge organizers, with an API for easy integration of new models and datasets to keep up with the fast-changing landscape of VQA. Our tool helps test generalization capabilities of models across multiple datasets, evaluating not just accuracy, but also performance in more realistic real-world scenarios such as robustness to input noise. Additionally, we include metrics that measure biases and uncertainty, to further explain model behavior. Interactive filtering facilitates discovery of problematic behavior, down to the data sample level. As proof of concept, we perform a case study on four models. We find that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsTest
