Open World Scene Graph Generation using Vision Language Models

Amartya Dutta; Kazi Sajeed Mehrab; Medha Sawhney; Abhilash Neog; Mridul Khurana; Sepideh Fatemi; Aanish Pradhan; M. Maruf; Ismini Lourentzou; Arka Daw; Anuj Karpatne

arXiv:2506.08189·cs.CV·June 11, 2025

Open World Scene Graph Generation using Vision Language Models

Amartya Dutta, Kazi Sajeed Mehrab, Medha Sawhney, Abhilash Neog, Mridul Khurana, Sepideh Fatemi, Aanish Pradhan, M. Maruf, Ismini Lourentzou, Arka Daw, Anuj Karpatne

PDF

Open Access 1 Repo

TL;DR

This paper presents an open-world scene graph generation method that leverages pretrained vision-language models in a zero-shot, training-free manner, enabling relational understanding of images with unseen objects and relations.

Contribution

It introduces a novel, training-free framework that uses multimodal prompting and embedding alignment to perform scene graph generation without dataset-specific fine-tuning.

Findings

01

Effective zero-shot performance on multiple datasets

02

Capable of handling unseen objects and relations

03

Outperforms traditional supervised methods in open-world scenarios

Abstract

Scene-Graph Generation (SGG) seeks to recognize objects in an image and distill their salient pairwise relationships. Most methods depend on dataset-specific supervision to learn the variety of interactions, restricting their usefulness in open-world settings, involving novel objects and/or relations. Even methods that leverage large Vision Language Models (VLMs) typically require benchmark-specific fine-tuning. We introduce Open-World SGG, a training-free, efficient, model-agnostic framework that taps directly into the pretrained knowledge of VLMs to produce scene graphs with zero additional learning. Casting SGG as a zero-shot structured-reasoning problem, our method combines multimodal prompting, embedding alignment, and a lightweight pair-refinement strategy, enabling inference over unseen object vocabularies and relation sets. To assess this setting, we formalize an Open-World…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shtuplus/pix2grp_cvpr2024
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis