Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Jiasen Lu; Christopher Clark; Rowan Zellers; Roozbeh Mottaghi,; Aniruddha Kembhavi

arXiv:2206.08916·cs.CV·October 6, 2022·110 cites

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi,, Aniruddha Kembhavi

PDF

Open Access 2 Videos

TL;DR

Unified-IO is a transformer-based model that unifies a wide range of vision, language, and multi-modal tasks into a single framework by converting all inputs and outputs into token sequences, enabling multi-task learning across diverse datasets.

Contribution

The paper introduces Unified-IO, the first model capable of handling multiple vision, language, and multi-modal tasks simultaneously without task-specific fine-tuning.

Findings

01

Achieves state-of-the-art results on the GRIT benchmark.

02

Performs well across 16 diverse benchmarks.

03

Handles 7 tasks on the GRIT benchmark without fine-tuning.

Abstract

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing. Developing a single unified model for such a large variety of tasks poses unique challenges due to the heterogeneous inputs and outputs pertaining to each task, including RGB images, per-pixel maps, binary masks, bounding boxes, and language. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. This common representation across all tasks allows us to train a single transformer-based architecture, jointly on over 90 diverse datasets in the vision and language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

[ML News] Text-to-Image models are taking over! (Imagen, DALL-E 2, Midjourney, CogView 2 & more)· youtube

UNIFIED-IO: A Unified Model for Vision, Language, and Multi-modal Tasks· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition