VASR: Visual Analogies of Situation Recognition
Yonatan Bitton, Ron Yosef, Eli Strugo, Dafna Shahaf, Roy Schwartz,, Gabriel Stanovsky

TL;DR
This paper introduces a new visual analogy task involving complex scene understanding, creating a large dataset with human-validated analogies, and evaluates current models' performance, highlighting challenges in analogy reasoning.
Contribution
It presents a novel visual analogy dataset based on situation recognition, leveraging CLIP and crowdsourcing, and benchmarks model performance on complex scene-based analogies.
Findings
Humans agree with dataset labels ~80%.
Models achieve ~86% accuracy on random distractors, but only ~53% on carefully chosen distractors.
State-of-the-art models struggle with complex analogies compared to humans.
Abstract
A core process in human cognition is analogical mapping: the ability to identify a similar relational structure between different situations. We introduce a novel task, Visual Analogies of Situation Recognition, adapting the classical word-analogy task into the visual domain. Given a triplet of images, the task is to select an image candidate B' that completes the analogy (A to A' is like B to what?). Unlike previous work on visual analogy that focused on simple image transformations, we tackle complex analogies requiring understanding of scenes. We leverage situation recognition annotations and the CLIP model to generate a large set of 500k candidate analogies. Crowdsourced annotations for a sample of the data indicate that humans agree with the dataset label ~80% of the time (chance level 25%). Furthermore, we use human annotations to create a gold-standard dataset of 3,820…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Human Pose and Action Recognition
MethodsContrastive Language-Image Pre-training
