Visual Riddles: a Commonsense and World Knowledge Challenge for Large   Vision and Language Models

Nitzan Bitton-Guetta; Aviv Slobodkin; Aviya Maimon; Eliya Habba; Royi; Rassin; Yonatan Bitton; Idan Szpektor; Amir Globerson; Yuval Elovici

arXiv:2407.19474·cs.CV·November 26, 2024·1 cites

Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models

Nitzan Bitton-Guetta, Aviv Slobodkin, Aviya Maimon, Eliya Habba, Royi, Rassin, Yonatan Bitton, Idan Szpektor, Amir Globerson, Yuval Elovici

PDF

Open Access 1 Video

TL;DR

Visual Riddles is a new benchmark designed to evaluate vision and language models on complex visual scenarios requiring commonsense and world knowledge, revealing significant gaps between current models and human performance.

Contribution

This paper introduces Visual Riddles, a benchmark with 400 visual riddles created from various text-to-image models, to challenge and assess vision-language models' reasoning abilities.

Findings

01

Current models significantly underperform humans in the benchmark.

02

Gemini-Pro-1.5 achieves 40% accuracy, far below human 82%.

03

The benchmark enables scalable automatic evaluation.

Abstract

Imagine observing someone scratching their arm; to understand why, additional context would be necessary. However, spotting a mosquito nearby would immediately offer a likely explanation for the person's discomfort, thereby alleviating the need for further information. This example illustrates how subtle visual cues can challenge our cognitive skills and demonstrates the complexity of interpreting visual scenarios. To study these skills, we present Visual Riddles, a benchmark aimed to test vision and language models on visual riddles requiring commonsense and world knowledge. The benchmark comprises 400 visual riddles, each featuring a unique image created by a variety of text-to-image models, question, ground-truth answer, textual hint, and attribution. Human evaluation reveals that existing models lag significantly behind human performance, which is at 82% accuracy, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications