VASR: Visual Analogies of Situation Recognition

Yonatan Bitton; Ron Yosef; Eli Strugo; Dafna Shahaf; Roy Schwartz,; Gabriel Stanovsky

arXiv:2212.04542·cs.CV·December 12, 2022

VASR: Visual Analogies of Situation Recognition

Yonatan Bitton, Ron Yosef, Eli Strugo, Dafna Shahaf, Roy Schwartz,, Gabriel Stanovsky

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces a new visual analogy task involving complex scene understanding, creating a large dataset with human-validated analogies, and evaluates current models' performance, highlighting challenges in analogy reasoning.

Contribution

It presents a novel visual analogy dataset based on situation recognition, leveraging CLIP and crowdsourcing, and benchmarks model performance on complex scene-based analogies.

Findings

01

Humans agree with dataset labels ~80%.

02

Models achieve ~86% accuracy on random distractors, but only ~53% on carefully chosen distractors.

03

State-of-the-art models struggle with complex analogies compared to humans.

Abstract

A core process in human cognition is analogical mapping: the ability to identify a similar relational structure between different situations. We introduce a novel task, Visual Analogies of Situation Recognition, adapting the classical word-analogy task into the visual domain. Given a triplet of images, the task is to select an image candidate B' that completes the analogy (A to A' is like B to what?). Unlike previous work on visual analogy that focused on simple image transformations, we tackle complex analogies requiring understanding of scenes. We leverage situation recognition annotations and the CLIP model to generate a large set of 500k candidate analogies. Crowdsourced annotations for a sample of the data indicate that humans agree with the dataset label ~80% of the time (chance level 25%). Furthermore, we use human annotations to create a gold-standard dataset of 3,820…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vasr-dataset/vasr
pytorchOfficial

Datasets

nlphuji/vasr
dataset· 494 dl
494 dl

Videos

VASR: Visual Analogies of Situation Recognition· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Human Pose and Action Recognition

MethodsContrastive Language-Image Pre-training