SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning
Xu Zhang, Jin Yuan, Hanwang Zhang, Guojin Zhong, Yongsheng Zang, Jiacheng Lin, Zhiyong Li

TL;DR
SGDiff introduces a novel scene graph guided diffusion model for image segmentation and captioning, enabling flexible, minimal prompt-based semantic interpretation with improved alignment and diverse outputs.
Contribution
The paper proposes a new task of Image Collaborative Segmentation and Captioning and develops a scene graph guided diffusion model with a prompt adaptor and contrastive learning for accurate, aligned caption-mask predictions.
Findings
SGDiff outperforms existing methods on benchmark datasets.
The model effectively captures user intent from minimal prompts.
Results show high-quality, diverse caption and segmentation outputs.
Abstract
Controllable image semantic understanding tasks, such as captioning or segmentation, necessitate users to input a prompt (e.g., text or bounding boxes) to predict a unique outcome, presenting challenges such as high-cost prompt input or limited information output. This paper introduces a new task ``Image Collaborative Segmentation and Captioning'' (SegCaptioning), which aims to translate a straightforward prompt, like a bounding box around an object, into diverse semantic interpretations represented by (caption, masks) pairs, allowing flexible result selection by users. This task poses significant challenges, including accurately capturing a user's intention from a minimal prompt while simultaneously predicting multiple semantically aligned caption words and masks. Technically, we propose a novel Scene Graph Guided Diffusion Model that leverages structured scene graph features for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Advanced Graph Neural Networks
