UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

Jianfeng Wang; Xiaowei Hu; Zhe Gan; Zhengyuan Yang; Xiyang Dai,; Zicheng Liu; Yumao Lu; Lijuan Wang

arXiv:2111.10023·cs.CV·November 22, 2021·28 cites

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

Jianfeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai,, Zicheng Liu, Yumao Lu, Lijuan Wang

PDF

Open Access

TL;DR

UFO introduces a unified transformer model capable of handling unimodal and multimodal vision-language tasks through multi-task pre-training, simplifying architecture and achieving state-of-the-art results across various benchmarks.

Contribution

The paper presents a single transformer framework for all vision-language tasks, reducing complexity and improving performance via multi-task learning during pre-training.

Findings

01

Achieves new state-of-the-art on visual question answering.

02

Sets new records on COCO image captioning and nocaps.

03

Performs competitively on image-text retrieval.

Abstract

In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e.g., image or language) or multimodal inputs (e.g., the concatenation of the image and the question), for vision-language (VL) representation learning. Existing approaches typically design an individual network for each modality and/or a specific fusion network for multimodal tasks. To simplify the network architecture, we use a single transformer network and enforce multi-task learning during VL pre-training, which includes the image-text contrastive loss, image-text matching loss, and masked language modeling loss based on the bidirectional and the seq2seq attention mask. The same transformer network is used as the image encoder, the text encoder, or the fusion network in different pre-training tasks. Empirically, we observe less conflict among different tasks and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Sequence to Sequence