ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis

Advik Sinha; Saurabh Atreya; Aashutosh A V; Sk Aziz Ali; Abhijit Das

arXiv:2511.20274·cs.CV·November 26, 2025

ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis

Advik Sinha, Saurabh Atreya, Aashutosh A V, Sk Aziz Ali, Abhijit Das

PDF

Open Access

TL;DR

ScenarioCLIP introduces a novel model and dataset for understanding complex real-world scenes by explicitly modeling objects, actions, and relations, improving zero-shot and fine-tuned scene analysis tasks.

Contribution

It presents ScenarioCLIP, a model that incorporates relational grounding and a new dataset for comprehensive scene understanding beyond class-level discrimination.

Findings

01

ScenarioCLIP outperforms baselines on various scene understanding tasks.

02

The dataset enables effective training and evaluation of relational scene models.

03

ScenarioCLIP demonstrates strong zero-shot capabilities.

Abstract

Until recently, the general corpus of CLIP-type fundamental models has widely explored either the retrieval of short descriptions or the classification of objects in the scene as SINGLE-object image classification task. The same holds for retrieving the image embedding (image retrieval task) given a text prompt. However, real-world scene images exhibit rich compositional structure involving multiple objects and actions. The latest methods in the CLIP-based literature improve class-level discrimination by mining harder negative image-text pairs and by refining permanent text prompts, often using LLMs. However, these improvements remain confined to predefined class lists and do not explicitly model relational or compositional structure. PyramidCLIP partially addresses this gap by aligning global and local visual features, yet it still lacks explicit modeling of inter-object relations.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning