NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Baiqi Li; Zhiqiu Lin; Wenxuan Peng; Jean de Dieu Nyandwi; Daniel Jiang; Zixian Ma; Simran Khanuja; Ranjay Krishna; Graham Neubig; Deva Ramanan

arXiv:2410.14669·cs.CV·June 11, 2025·3 cites

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Baiqi Li, Zhiqiu Lin, Wenxuan Peng, Jean de Dieu Nyandwi, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna, Graham Neubig, Deva Ramanan

PDF

Open Access 2 Datasets

TL;DR

NaturalBench is a new challenging benchmark for vision-language models that tests their ability to answer questions based on natural images, revealing significant gaps compared to human performance.

Contribution

The paper introduces NaturalBench, a semi-automated, vision-centric benchmark with 10,000 human-verified VQA samples designed to evaluate VLMs more reliably and challenging than previous benchmarks.

Findings

01

Most state-of-the-art VLMs lag 50-70% behind human performance.

02

NaturalBench exposes severe biases in VLMs, with models often ignoring image content.

03

Solving NaturalBench requires diverse visio-linguistic skills and advanced reasoning.

Abstract

Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using off-the-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples. Crucially, we adopt a $vision-centric$ design by pairing each question with two images that yield different answers, preventing blind solutions from answering without using the images. This makes NaturalBench more challenging than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training