
TL;DR
This paper introduces ISRG, a new image-to-text model that simplifies scene graph generation by decoupling object detection and relation prediction, achieving significant performance improvements on the OpenPSG dataset.
Contribution
The paper proposes a novel two-step approach for scene graph generation, reducing annotation costs and improving accuracy over existing methods.
Findings
Achieved 31 points on OpenPSG dataset.
Outperformed ResNet-50 baseline by 16 points.
Outperformed CLIP baseline by 5 points.
Abstract
Scene graphs provide structured semantic understanding beyond images. For downstream tasks, such as image retrieval, visual question answering, visual relationship detection, and even autonomous vehicle technology, scene graphs can not only distil complex image information but also correct the bias of visual models using semantic-level relations, which has broad application prospects. However, the heavy labour cost of constructing graph annotations may hinder the application of PSG in practical scenarios. Inspired by the observation that people usually identify the subject and object first and then determine the relationship between them, we proposed to decouple the scene graphs generation task into two sub-tasks: 1) an image segmentation task to pick up the qualified objects. 2) a restricted auto-regressive text generation task to generate the relation between given objects. Therefore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
