Scene Graph Generation with Role-Playing Large Language Models
Guikun Chen, Jin Li, Wenguan Wang

TL;DR
This paper introduces SDSGG, a scene-specific scene graph generation framework that uses role-playing large language models to adapt text classifiers based on scene content, significantly improving relation recognition accuracy.
Contribution
The work proposes a novel scene-specific OVSGG framework with adaptive text classifiers generated by role-playing LLMs and a mutual visual adapter for better relation modeling.
Findings
SDSGG outperforms existing methods on benchmark datasets.
Adaptive scene-specific classifiers improve relation detection.
Role-playing LLMs enhance descriptive feature analysis.
Abstract
Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP and follow a standard zero-shot pipeline -- computing similarity between the query image and the text embeddings for each category (i.e., text classifiers). In this work, we argue that the text classifiers adopted by existing OVSGG methods, i.e., category-/part-level prompts, are scene-agnostic as they remain unchanged across contexts. Using such fixed text classifiers not only struggles to model visual relations with high variance, but also falls short in adapting to distinct contexts. To plug these intrinsic shortcomings, we devise SDSGG, a scene-specific description based OVSGG framework where the weights of text classifiers are adaptively adjusted according to the visual content. In particular, to generate comprehensive and diverse descriptions oriented to the scene, an LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training
