Harnessing Diffusion Models for Visual Perception with Meta Prompts

Qiang Wan; Zilong Huang; Bingyi Kang; Jiashi Feng; Li Zhang

arXiv:2312.14733·cs.CV·December 25, 2023·5 cites

Harnessing Diffusion Models for Visual Perception with Meta Prompts

Qiang Wan, Zilong Huang, Bingyi Kang, Jiashi Feng, Li Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel method to adapt pre-trained diffusion models for visual perception tasks by using learnable meta prompts and a recurrent refinement strategy, achieving state-of-the-art results across multiple benchmarks.

Contribution

The paper proposes the use of learnable meta prompts and a recurrent refinement training strategy to effectively repurpose diffusion models for various visual perception tasks.

Findings

01

Achieves new records in depth estimation on NYU depth V2 and KITTI.

02

Attains top performance in semantic segmentation on CityScapes.

03

Performs comparably to state-of-the-art in semantic segmentation on ADE20K and pose estimation on COCO.

Abstract

The issue of generative pretraining for vision models has persisted as a long-standing conundrum. At present, the text-to-image (T2I) diffusion model demonstrates remarkable proficiency in generating high-definition images matching textual inputs, a feat made possible through its pre-training on large-scale image-text pairs. This leads to a natural inquiry: can diffusion models be utilized to tackle visual perception tasks? In this paper, we propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. Our key insight is to introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. The effect of meta prompts are two-fold. First, as a direct replacement of the text embeddings in the T2I models, it can activate task-relevant features during feature extraction. Second, it will be used to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fudan-zvg/meta-prompts
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsDiffusion