OFA: Unifying Architectures, Tasks, and Modalities Through a Simple   Sequence-to-Sequence Learning Framework

Peng Wang; An Yang; Rui Men; Junyang Lin; Shuai Bai; Zhikang Li,; Jianxin Ma; Chang Zhou; Jingren Zhou; Hongxia Yang

arXiv:2202.03052·cs.CV·June 2, 2022·258 cites

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li,, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang

PDF

Open Access 4 Repos 1 Models

TL;DR

OFA introduces a simple, unified sequence-to-sequence framework for multimodal pretraining that supports diverse tasks and modalities without task-specific customization, achieving state-of-the-art results with relatively small data.

Contribution

It proposes OFA, a task-agnostic, modality-agnostic model that unifies multiple cross-modal and unimodal tasks in a single sequence-to-sequence framework, requiring no extra task-specific layers.

Findings

01

Achieves new SOTA in various cross-modal tasks.

02

Performs competitively on unimodal tasks.

03

Effectively transfers to unseen tasks and domains.

Abstract

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
sakana-yu/ofa_ocr-recognition_general_base_zh
model· 2 dl· ♡ 1
2 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsBitcoin Customer Service Number +1-833-534-1729 · Multi-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · OFA · MoCo v3 · Average Pooling · Global Average Pooling · Max Pooling