TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

Chengyang Zhao; Yikang Shen; Zhenfang Chen; Mingyu Ding; Chuang Gan

arXiv:2310.07056·cs.CV·April 11, 2025·1 cites

TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

Chengyang Zhao, Yikang Shen, Zhenfang Chen, Mingyu Ding, Chuang Gan

PDF

Open Access 1 Video

TL;DR

This paper introduces TextPSG, a novel framework that generates panoptic scene graphs from textual descriptions alone, eliminating the need for dense pixel-wise annotations by leveraging web image-caption data.

Contribution

It proposes a new method for panoptic scene graph generation from text, addressing challenges of no location priors, explicit links, or predefined concepts, and demonstrates significant performance improvements.

Findings

01

Outperforms baseline methods significantly.

02

Achieves strong robustness to out-of-distribution data.

03

Effective in generating detailed scene graphs from text.

Abstract

Panoptic Scene Graph has recently been proposed for comprehensive scene understanding. However, previous works adopt a fully-supervised learning manner, requiring large amounts of pixel-wise densely-annotated data, which is always tedious and expensive to obtain. To address this limitation, we study a new problem of Panoptic Scene Graph Generation from Purely Textual Descriptions (Caption-to-PSG). The key idea is to leverage the large collection of free image-caption data on the Web alone to generate panoptic scene graphs. The problem is very challenging for three constraints: 1) no location priors; 2) no explicit links between visual regions and textual entities; and 3) no pre-defined concept sets. To tackle this problem, we propose a new framework TextPSG consisting of four modules, i.e., a region grouper, an entity grounder, a segment merger, and a label generator, with several novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

TextPSG: Panoptic Scene Graph Generation from Textual Descriptions· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization