COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark

Ishant Chintapatla; Kazuma Choji; Naaisha Agarwal; Andrew Lin; Hannah You; Charles Duong; Kevin Zhu; Sean O'Brien; Vasu Sharma

arXiv:2507.13405·cs.CV·July 21, 2025

COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark

Ishant Chintapatla, Kazuma Choji, Naaisha Agarwal, Andrew Lin, Hannah You, Charles Duong, Kevin Zhu, Sean O'Brien, Vasu Sharma

PDF

Open Access 1 Datasets

TL;DR

COREVQA introduces a new benchmark for evaluating vision-language models' ability to perform visual entailment reasoning in crowded scenes, revealing significant gaps in current models' capabilities.

Contribution

The paper presents COREVQA, a novel benchmark with challenging crowded images and true/false statements to test visual entailment reasoning in VLMs.

Findings

01

Top VLMs achieve below 80% accuracy on COREVQA.

02

Most models perform significantly worse than top performers.

03

The benchmark exposes limitations in current VLM reasoning abilities.

Abstract

Recently, many benchmarks and datasets have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, and models have shown significant accuracy improvements. However, these benchmarks rarely test the model's ability to accurately complete visual entailment, for instance, accepting or refuting a hypothesis based on the image. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5608 image and synthetically generated true/false statement pairs, with images derived from the CrowdHuman dataset, to provoke visual entailment reasoning on challenging crowded images. Our results show that even the top-performing VLMs achieve accuracy below 80%, with other models performing substantially worse (39.98%-69.95%). This significant performance gap reveals key limitations in VLMs' ability to reason over certain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

COREVQA2025/COREVQA
dataset· 114 dl
114 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Mobile Crowdsensing and Crowdsourcing · Topic Modeling