Unleashing Text-to-Image Diffusion Models for Visual Perception

Wenliang Zhao; Yongming Rao; Zuyan Liu; Benlin Liu; Jie Zhou; Jiwen Lu

arXiv:2303.02153·cs.CV·March 6, 2023·6 cites

Unleashing Text-to-Image Diffusion Models for Visual Perception

Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, Jiwen Lu

PDF

Open Access 2 Repos

TL;DR

This paper introduces VPD, a framework that leverages pre-trained text-to-image diffusion models for various visual perception tasks, demonstrating improved performance and efficiency over existing methods.

Contribution

The paper proposes a novel approach to utilize pre-trained diffusion models for visual perception by prompting and refining text features, and using cross-attention maps for guidance.

Findings

01

Achieves state-of-the-art results on depth estimation and referring image segmentation.

02

Demonstrates faster adaptation to downstream tasks compared to other pre-training methods.

03

Validates effectiveness across multiple visual perception benchmarks.

Abstract

Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly controllable by customizable prompts. Unlike the unconditional generative models that focus on low-level attributes and details, text-to-image diffusion models contain more high-level knowledge thanks to the vision-language pre-training. In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. Instead of using the pre-trained denoising autoencoder in a diffusion-based pipeline, we simply use it as a backbone and aim to study how to take full advantage of the learned knowledge. Specifically, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications

MethodsDiffusion · Denoising Autoencoder