From Data to Modeling: Fully Open-vocabulary Scene Graph Generation
Zuyao Chen, Jinlin Wu, Zhen Lei, Chang Wen Chen

TL;DR
This paper introduces OvSGTR, a transformer-based framework for fully open-vocabulary scene graph generation that predicts objects and relationships beyond fixed categories, enabling more flexible and comprehensive visual scene understanding.
Contribution
The paper proposes a novel transformer architecture with relation-aware pre-training and a knowledge retention mechanism for open-vocabulary scene graph generation, outperforming previous methods.
Findings
Achieves state-of-the-art results on VG150 benchmark.
Effectively handles open-vocabulary object and relation recognition.
Demonstrates robustness across multiple open-vocabulary scenarios.
Abstract
We present OvSGTR, a novel transformer-based framework for fully open-vocabulary scene graph generation that overcomes the limitations of traditional closed-set models. Conventional methods restrict both object and relationship recognition to a fixed vocabulary, hindering their applicability to real-world scenarios where novel concepts frequently emerge. In contrast, our approach jointly predicts objects (nodes) and their inter-relationships (edges) beyond predefined categories. OvSGTR leverages a DETR-like architecture featuring a frozen image backbone and text encoder to extract high-quality visual and semantic features, which are then fused via a transformer decoder for end-to-end scene graph prediction. To enrich the model's understanding of complex visual relations, we propose a relation-aware pre-training strategy that synthesizes scene graph annotations in a weakly supervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsKnowledge Distillation
