VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual   Grounders

Xuyang Liu; Siteng Huang; Yachen Kang; Honggang Chen; Donglin Wang

arXiv:2309.01141·cs.CV·January 24, 2024

VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

Xuyang Liu, Siteng Huang, Yachen Kang, Honggang Chen, Donglin Wang

PDF

Open Access 1 Repo

TL;DR

VGDiffZero demonstrates that pre-trained text-to-image diffusion models can be effectively used for zero-shot visual grounding without additional training, leveraging a novel region-scoring method.

Contribution

The paper introduces VGDiffZero, a zero-shot visual grounding framework that applies pre-trained diffusion models directly to discriminative tasks without fine-tuning.

Findings

01

Achieves strong zero-shot visual grounding performance on RefCOCO datasets.

02

Introduces a comprehensive region-scoring method considering global and local contexts.

03

Demonstrates the effectiveness of generative diffusion models for discriminative visual tasks.

Abstract

Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xuyang-liu16/vgdiffzero
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques

MethodsDiffusion