InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists
Yulu Gan, Sungwoo Park, Alexander Schubert, Anthony Philippakis, Ahmed, M. Alaa

TL;DR
InstructCV is a unified, instruction-tuned text-to-image diffusion model that can perform multiple computer vision tasks from natural language instructions, demonstrating strong generalization and competitive performance.
Contribution
This work introduces a novel approach to unify diverse vision tasks as text-to-image generation problems using instruction tuning on a diffusion model.
Findings
Performs competitively with task-specific models
Generalizes well to unseen data and instructions
Uses a multi-task dataset created from existing vision datasets
Abstract
Recent advances in generative diffusion models have enabled text-controlled synthesis of realistic and diverse images with impressive quality. Despite these remarkable advances, the application of text-to-image generative models in computer vision for standard visual recognition tasks remains limited. The current de facto approach for these tasks is to design model architectures and loss functions that are tailored to the task at hand. In this paper, we develop a unified language interface for computer vision tasks that abstracts away task-specific design choices and enables task execution by following natural language instructions. Our approach involves casting multiple computer vision tasks as text-to-image generation problems. Here, the text represents an instruction describing the task, and the resulting image is a visually-encoded task output. To train our model, we pool…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsDiffusion
