Harnessing Diffusion Models for Visual Perception with Meta Prompts
Qiang Wan, Zilong Huang, Bingyi Kang, Jiashi Feng, Li Zhang

TL;DR
This paper introduces a novel method to adapt pre-trained diffusion models for visual perception tasks by using learnable meta prompts and a recurrent refinement strategy, achieving state-of-the-art results across multiple benchmarks.
Contribution
The paper proposes the use of learnable meta prompts and a recurrent refinement training strategy to effectively repurpose diffusion models for various visual perception tasks.
Findings
Achieves new records in depth estimation on NYU depth V2 and KITTI.
Attains top performance in semantic segmentation on CityScapes.
Performs comparably to state-of-the-art in semantic segmentation on ADE20K and pose estimation on COCO.
Abstract
The issue of generative pretraining for vision models has persisted as a long-standing conundrum. At present, the text-to-image (T2I) diffusion model demonstrates remarkable proficiency in generating high-definition images matching textual inputs, a feat made possible through its pre-training on large-scale image-text pairs. This leads to a natural inquiry: can diffusion models be utilized to tackle visual perception tasks? In this paper, we propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. Our key insight is to introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. The effect of meta prompts are two-fold. First, as a direct replacement of the text embeddings in the T2I models, it can activate task-relevant features during feature extraction. Second, it will be used to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsDiffusion
