Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi,, Aniruddha Kembhavi

TL;DR
Unified-IO is a transformer-based model that unifies a wide range of vision, language, and multi-modal tasks into a single framework by converting all inputs and outputs into token sequences, enabling multi-task learning across diverse datasets.
Contribution
The paper introduces Unified-IO, the first model capable of handling multiple vision, language, and multi-modal tasks simultaneously without task-specific fine-tuning.
Findings
Achieves state-of-the-art results on the GRIT benchmark.
Performs well across 16 diverse benchmarks.
Handles 7 tasks on the GRIT benchmark without fine-tuning.
Abstract
We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing. Developing a single unified model for such a large variety of tasks poses unique challenges due to the heterogeneous inputs and outputs pertaining to each task, including RGB images, per-pixel maps, binary masks, bounding boxes, and language. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. This common representation across all tasks allows us to train a single transformer-based architecture, jointly on over 90 diverse datasets in the vision and language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
