DARE: Diverse Visual Question Answering with Robustness Evaluation

Hannah Sterz; Jonas Pfeiffer; Ivan Vuli\'c

arXiv:2409.18023·cs.CL·July 22, 2025

DARE: Diverse Visual Question Answering with Robustness Evaluation

Hannah Sterz, Jonas Pfeiffer, Ivan Vuli\'c

PDF

Open Access 2 Datasets 1 Video

TL;DR

DARE introduces a comprehensive benchmark for evaluating the robustness of vision-language models across diverse visual question answering scenarios, revealing significant performance gaps and brittleness in current models.

Contribution

The paper presents DARE, a new diverse VQA benchmark with robustness evaluations, highlighting the limitations of current VLMs in handling variations and complex reasoning tasks.

Findings

01

State-of-the-art VLMs struggle with most question categories.

02

Performance drops up to 34% under robustness variations.

03

Open-source models are less robust than closed-source models.

Abstract

Vision Language Models (VLMs) extend remarkable capabilities of text-only large language models and vision-only models, and are able to learn from and process multi-modal vision-text input. While modern VLMs perform well on a number of standard image classification and image-text matching tasks, they still struggle with a number of crucial vision-language (VL) reasoning abilities such as counting and spatial reasoning. Moreover, while they might be very brittle to small variations in instructions and/or evaluation protocols, existing benchmarks fail to evaluate their robustness (or rather the lack of it). In order to couple challenging VL scenarios with comprehensive robustness evaluation, we introduce DARE, Diverse Visual Question Answering with Robustness Evaluation, a carefully created and curated multiple-choice VQA benchmark. DARE evaluates VLM performance on five diverse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

DARE: Diverse Visual Question Answering with Robustness Evaluation· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Layer Normalization · Dense Connections · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding