Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

Danni Yang; Ruohan Dong; Jiayi Ji; Yiwei Ma; Haowei Wang; Xiaoshuai; Sun; and Rongrong Ji

arXiv:2407.05352·cs.CV·July 9, 2024

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

Danni Yang, Ruohan Dong, Jiayi Ji, Yiwei Ma, Haowei Wang, Xiaoshuai, Sun, and Rongrong Ji

PDF

Open Access 1 Repo

TL;DR

This paper introduces DiffPNG, a diffusion model-based framework for phrase-level visual grounding that leverages segmentation and refinement techniques to achieve zero-shot performance on the PNG dataset.

Contribution

The paper presents a novel diffusion model approach for phrase-level grounding, fully utilizing diffusion architecture for segmentation and zero-shot learning.

Findings

01

DiffPNG achieves strong zero-shot performance on PNG dataset.

02

The framework effectively decomposes localization, segmentation, and refinement.

03

Refinement with SAM improves segmentation quality.

Abstract

Recently, diffusion models have increasingly demonstrated their capabilities in vision understanding. By leveraging prompt-based learning to construct sentences, these models have shown proficiency in classification and visual grounding tasks. However, existing approaches primarily showcase their ability to perform sentence-level localization, leaving the potential for leveraging contextual information for phrase-level understanding largely unexplored. In this paper, we utilize Panoptic Narrative Grounding (PNG) as a proxy task to investigate this capability further. PNG aims to segment object instances mentioned by multiple noun phrases within a given narrative text. Specifically, we introduce the DiffPNG framework, a straightforward yet effective approach that fully capitalizes on the diffusion's architecture for segmentation by decomposing the process into a sequence of localization,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nini0919/diffpng
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques

MethodsSegment Anything Model · Diffusion