Toward a Diffusion-Based Generalist for Dense Vision Tasks
Yue Fan, Yongqin Xian, Xiaohua Zhai, Alexander Kolesnikov, Muhammad, Ferjad Naeem, Bernt Schiele, Federico Tombari

TL;DR
This paper proposes a diffusion-based approach to unify various dense vision tasks as conditional image generation, fine-tuning pre-trained diffusion models in pixel space for improved generalist visual perception.
Contribution
It introduces a novel pixel-space diffusion method and a fine-tuning recipe for adapting pre-trained text-to-image models to dense vision tasks.
Findings
Achieves competitive performance on multiple dense vision tasks
Addresses quantization issues in latent diffusion models
Demonstrates versatility of diffusion models for vision generalization
Abstract
Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring results. In this paper, we explore diffusion-based vision generalists, where we unify different types of dense prediction tasks as conditional image generation and re-purpose pre-trained diffusion models for it. However, directly applying off-the-shelf latent diffusion models leads to a quantization issue. Thus, we propose to perform diffusion in pixel space and provide a recipe for finetuning pre-trained text-to-image diffusion models for dense vision tasks. In experiments, we evaluate our method on four different types of tasks and show competitive performance to the other vision generalists.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors
MethodsDiffusion
