Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models
Nitzan Bitton-Guetta, Aviv Slobodkin, Aviya Maimon, Eliya Habba, Royi, Rassin, Yonatan Bitton, Idan Szpektor, Amir Globerson, Yuval Elovici

TL;DR
Visual Riddles is a new benchmark designed to evaluate vision and language models on complex visual scenarios requiring commonsense and world knowledge, revealing significant gaps between current models and human performance.
Contribution
This paper introduces Visual Riddles, a benchmark with 400 visual riddles created from various text-to-image models, to challenge and assess vision-language models' reasoning abilities.
Findings
Current models significantly underperform humans in the benchmark.
Gemini-Pro-1.5 achieves 40% accuracy, far below human 82%.
The benchmark enables scalable automatic evaluation.
Abstract
Imagine observing someone scratching their arm; to understand why, additional context would be necessary. However, spotting a mosquito nearby would immediately offer a likely explanation for the person's discomfort, thereby alleviating the need for further information. This example illustrates how subtle visual cues can challenge our cognitive skills and demonstrates the complexity of interpreting visual scenarios. To study these skills, we present Visual Riddles, a benchmark aimed to test vision and language models on visual riddles requiring commonsense and world knowledge. The benchmark comprises 400 visual riddles, each featuring a unique image created by a variety of text-to-image models, question, ground-truth answer, textual hint, and attribution. Human evaluation reveals that existing models lag significantly behind human performance, which is at 82% accuracy, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
