InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision   Generalists

Yulu Gan; Sungwoo Park; Alexander Schubert; Anthony Philippakis; Ahmed; M. Alaa

arXiv:2310.00390·cs.CV·March 19, 2024

InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

Yulu Gan, Sungwoo Park, Alexander Schubert, Anthony Philippakis, Ahmed, M. Alaa

PDF

Open Access 1 Repo 1 Video

TL;DR

InstructCV is a unified, instruction-tuned text-to-image diffusion model that can perform multiple computer vision tasks from natural language instructions, demonstrating strong generalization and competitive performance.

Contribution

This work introduces a novel approach to unify diverse vision tasks as text-to-image generation problems using instruction tuning on a diffusion model.

Findings

01

Performs competitively with task-specific models

02

Generalizes well to unseen data and instructions

03

Uses a multi-task dataset created from existing vision datasets

Abstract

Recent advances in generative diffusion models have enabled text-controlled synthesis of realistic and diverse images with impressive quality. Despite these remarkable advances, the application of text-to-image generative models in computer vision for standard visual recognition tasks remains limited. The current de facto approach for these tasks is to design model architectures and loss functions that are tailored to the task at hand. In this paper, we develop a unified language interface for computer vision tasks that abstracts away task-specific design choices and enables task execution by following natural language instructions. Our approach involves casting multiple computer vision tasks as text-to-image generation problems. Here, the text represents an instruction describing the task, and the resulting image is a visually-encoded task output. To train our model, we pool…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AlaaLab/InstructCV
pytorchOfficial

Videos

InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsDiffusion