TextPSG: Panoptic Scene Graph Generation from Textual Descriptions
Chengyang Zhao, Yikang Shen, Zhenfang Chen, Mingyu Ding, Chuang Gan

TL;DR
This paper introduces TextPSG, a novel framework that generates panoptic scene graphs from textual descriptions alone, eliminating the need for dense pixel-wise annotations by leveraging web image-caption data.
Contribution
It proposes a new method for panoptic scene graph generation from text, addressing challenges of no location priors, explicit links, or predefined concepts, and demonstrates significant performance improvements.
Findings
Outperforms baseline methods significantly.
Achieves strong robustness to out-of-distribution data.
Effective in generating detailed scene graphs from text.
Abstract
Panoptic Scene Graph has recently been proposed for comprehensive scene understanding. However, previous works adopt a fully-supervised learning manner, requiring large amounts of pixel-wise densely-annotated data, which is always tedious and expensive to obtain. To address this limitation, we study a new problem of Panoptic Scene Graph Generation from Purely Textual Descriptions (Caption-to-PSG). The key idea is to leverage the large collection of free image-caption data on the Web alone to generate panoptic scene graphs. The problem is very challenging for three constraints: 1) no location priors; 2) no explicit links between visual regions and textual entities; and 3) no pre-defined concept sets. To tackle this problem, we propose a new framework TextPSG consisting of four modules, i.e., a region grouper, an entity grounder, a segment merger, and a label generator, with several novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
