SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning

Xin Hu; Ke Qin; Guiduo Duan; Ming Li; Yuan-Fang Li; Tao He

arXiv:2507.05798·cs.CV·July 9, 2025

SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning

Xin Hu, Ke Qin, Guiduo Duan, Ming Li, Yuan-Fang Li, Tao He

PDF

Open Access

TL;DR

SPADE is a novel framework that enhances open-vocabulary panoptic scene graph generation by incorporating spatial-aware context reasoning and inversion-guided calibration, significantly improving relation prediction accuracy.

Contribution

The paper introduces SPADE, a new method combining diffusion model inversion and spatial-aware graph transformers for better spatial relation reasoning in PSG.

Findings

01

SPADE outperforms existing methods on benchmark datasets.

02

It achieves higher accuracy in spatial relationship prediction.

03

Effective in both closed- and open-set scenarios.

Abstract

Panoptic Scene Graph Generation (PSG) integrates instance segmentation with relation understanding to capture pixel-level structural relationships in complex scenes. Although recent approaches leveraging pre-trained vision-language models (VLMs) have significantly improved performance in the open-vocabulary setting, they commonly ignore the inherent limitations of VLMs in spatial relation reasoning, such as difficulty in distinguishing object relative positions, which results in suboptimal relation prediction. Motivated by the denoising diffusion model's inversion process in preserving the spatial structure of input images, we propose SPADE (SPatial-Aware Denoising-nEtwork) framework -- a novel approach for open-vocabulary PSG. SPADE consists of two key steps: (1) inversion-guided calibration for the UNet adaptation, and (2) spatial-aware context reasoning. In the first step, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion · Spatially-Adaptive Normalization