Beyond Accuracy: A Consolidated Tool for Visual Question Answering   Benchmarking

Dirk V\"ath; Pascal Tilli; Ngoc Thang Vu

arXiv:2110.05159·cs.CV·October 12, 2021

Beyond Accuracy: A Consolidated Tool for Visual Question Answering Benchmarking

Dirk V\"ath, Pascal Tilli, Ngoc Thang Vu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a comprehensive, browser-based benchmarking tool for Visual Question Answering that evaluates models on accuracy, robustness, biases, and uncertainty across multiple datasets, highlighting generalization challenges.

Contribution

The paper presents a new benchmarking tool for VQA that assesses models beyond accuracy, including robustness and bias, with easy integration and interactive analysis features.

Findings

01

State-of-the-art models lack generalization across datasets.

02

Metrics identify embeddings that improve robustness.

03

Models fail to recognize text in images despite dataset training.

Abstract

On the way towards general Visual Question Answering (VQA) systems that are able to answer arbitrary questions, the need arises for evaluation beyond single-metric leaderboards for specific datasets. To this end, we propose a browser-based benchmarking tool for researchers and challenge organizers, with an API for easy integration of new models and datasets to keep up with the fast-changing landscape of VQA. Our tool helps test generalization capabilities of models across multiple datasets, evaluating not just accuracy, but also performance in more realistic real-world scenarios such as robustness to input noise. Additionally, we include metrics that measure biases and uncertainty, to further explain model behavior. Interactive filtering facilitates discovery of problematic behavior, down to the data sample level. As proof of concept, we perform a case study on four models. We find that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

patilli/vqa_benchmarking
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsTest