TL;DR
DICEPTION is a versatile diffusion-based visual perception model that efficiently handles multiple tasks with minimal training data and computational resources, achieving near state-of-the-art performance.
Contribution
The paper introduces DICEPTION, a generalist diffusion model that re-purposes pre-trained text-to-image diffusion models for diverse perception tasks with low data and computational costs.
Findings
Achieves performance comparable to SOTA models with only 0.06% of their data
Requires fine-tuning on as few as 50 images for new tasks
Subtle classifier-free guidance improves depth and normal estimation
Abstract
This paper's primary objective is to develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data. We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model. Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models. Specifically, we achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs.\ 1B pixel-level annotated images). We designed comprehensive experiments on architectures and input paradigms, demonstrating that the key to successfully re-purposing a single diffusion model for multiple perception tasks lies in maximizing the preservation of the pre-trained model's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion
