Learning to Generate Scene Graph from Natural Language Supervision
Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, Yin Li

TL;DR
This paper introduces a novel method for generating scene graphs from images using natural language supervision, leveraging object detection and transformer models to improve accuracy and enable open-vocabulary scene graph generation.
Contribution
It presents one of the first approaches to learn scene graph generation from image-sentence pairs without relying on human-annotated scene graphs, achieving significant performance gains.
Findings
30% relative improvement over previous methods
Effective weakly and fully supervised scene graph generation
First open-set scene graph generation results
Abstract
Learning from image-text data has demonstrated recent success for many recognition tasks, yet is currently limited to visual features or individual visual concepts such as objects. In this paper, we propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph. To bridge the gap between images and texts, we leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph. Further, we design a Transformer-based model to predict these "pseudo" labels via a masked token prediction task. Learning from only image-sentence pairs, our model achieves 30% relative gain over a latest method trained with human-annotated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
