Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model
Danni Yang, Ruohan Dong, Jiayi Ji, Yiwei Ma, Haowei Wang, Xiaoshuai, Sun, and Rongrong Ji

TL;DR
This paper introduces DiffPNG, a diffusion model-based framework for phrase-level visual grounding that leverages segmentation and refinement techniques to achieve zero-shot performance on the PNG dataset.
Contribution
The paper presents a novel diffusion model approach for phrase-level grounding, fully utilizing diffusion architecture for segmentation and zero-shot learning.
Findings
DiffPNG achieves strong zero-shot performance on PNG dataset.
The framework effectively decomposes localization, segmentation, and refinement.
Refinement with SAM improves segmentation quality.
Abstract
Recently, diffusion models have increasingly demonstrated their capabilities in vision understanding. By leveraging prompt-based learning to construct sentences, these models have shown proficiency in classification and visual grounding tasks. However, existing approaches primarily showcase their ability to perform sentence-level localization, leaving the potential for leveraging contextual information for phrase-level understanding largely unexplored. In this paper, we utilize Panoptic Narrative Grounding (PNG) as a proxy task to investigate this capability further. PNG aims to segment object instances mentioned by multiple noun phrases within a given narrative text. Specifically, we introduce the DiffPNG framework, a straightforward yet effective approach that fully capitalizes on the diffusion's architecture for segmentation by decomposing the process into a sequence of localization,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques
MethodsSegment Anything Model · Diffusion
