Toward a Diffusion-Based Generalist for Dense Vision Tasks

Yue Fan; Yongqin Xian; Xiaohua Zhai; Alexander Kolesnikov; Muhammad; Ferjad Naeem; Bernt Schiele; Federico Tombari

arXiv:2407.00503·cs.CV·July 2, 2024

Toward a Diffusion-Based Generalist for Dense Vision Tasks

Yue Fan, Yongqin Xian, Xiaohua Zhai, Alexander Kolesnikov, Muhammad, Ferjad Naeem, Bernt Schiele, Federico Tombari

PDF

Open Access

TL;DR

This paper proposes a diffusion-based approach to unify various dense vision tasks as conditional image generation, fine-tuning pre-trained diffusion models in pixel space for improved generalist visual perception.

Contribution

It introduces a novel pixel-space diffusion method and a fine-tuning recipe for adapting pre-trained text-to-image models to dense vision tasks.

Findings

01

Achieves competitive performance on multiple dense vision tasks

02

Addresses quantization issues in latent diffusion models

03

Demonstrates versatility of diffusion models for vision generalization

Abstract

Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring results. In this paper, we explore diffusion-based vision generalists, where we unify different types of dense prediction tasks as conditional image generation and re-purpose pre-trained diffusion models for it. However, directly applying off-the-shelf latent diffusion models leads to a quantization issue. Thus, we propose to perform diffusion in pixel space and provide a recipe for finetuning pre-trained text-to-image diffusion models for dense vision tasks. In experiments, we evaluate our method on four different types of tasks and show competitive performance to the other vision generalists.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors

MethodsDiffusion