NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Baiqi Li, Zhiqiu Lin, Wenxuan Peng, Jean de Dieu Nyandwi, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna, Graham Neubig, Deva Ramanan

TL;DR
NaturalBench is a new challenging benchmark for vision-language models that tests their ability to answer questions based on natural images, revealing significant gaps compared to human performance.
Contribution
The paper introduces NaturalBench, a semi-automated, vision-centric benchmark with 10,000 human-verified VQA samples designed to evaluate VLMs more reliably and challenging than previous benchmarks.
Findings
Most state-of-the-art VLMs lag 50-70% behind human performance.
NaturalBench exposes severe biases in VLMs, with models often ignoring image content.
Solving NaturalBench requires diverse visio-linguistic skills and advanced reasoning.
Abstract
Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using off-the-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples. Crucially, we adopt a design by pairing each question with two images that yield different answers, preventing blind solutions from answering without using the images. This makes NaturalBench more challenging than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
