DARE: Diverse Visual Question Answering with Robustness Evaluation
Hannah Sterz, Jonas Pfeiffer, Ivan Vuli\'c

TL;DR
DARE introduces a comprehensive benchmark for evaluating the robustness of vision-language models across diverse visual question answering scenarios, revealing significant performance gaps and brittleness in current models.
Contribution
The paper presents DARE, a new diverse VQA benchmark with robustness evaluations, highlighting the limitations of current VLMs in handling variations and complex reasoning tasks.
Findings
State-of-the-art VLMs struggle with most question categories.
Performance drops up to 34% under robustness variations.
Open-source models are less robust than closed-source models.
Abstract
Vision Language Models (VLMs) extend remarkable capabilities of text-only large language models and vision-only models, and are able to learn from and process multi-modal vision-text input. While modern VLMs perform well on a number of standard image classification and image-text matching tasks, they still struggle with a number of crucial vision-language (VL) reasoning abilities such as counting and spatial reasoning. Moreover, while they might be very brittle to small variations in instructions and/or evaluation protocols, existing benchmarks fail to evaluate their robustness (or rather the lack of it). In order to couple challenging VL scenarios with comprehensive robustness evaluation, we introduce DARE, Diverse Visual Question Answering with Robustness Evaluation, a carefully created and curated multiple-choice VQA benchmark. DARE evaluates VLM performance on five diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Layer Normalization · Dense Connections · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding
