UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
Jianfeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai,, Zicheng Liu, Yumao Lu, Lijuan Wang

TL;DR
UFO introduces a unified transformer model capable of handling unimodal and multimodal vision-language tasks through multi-task pre-training, simplifying architecture and achieving state-of-the-art results across various benchmarks.
Contribution
The paper presents a single transformer framework for all vision-language tasks, reducing complexity and improving performance via multi-task learning during pre-training.
Findings
Achieves new state-of-the-art on visual question answering.
Sets new records on COCO image captioning and nocaps.
Performs competitively on image-text retrieval.
Abstract
In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e.g., image or language) or multimodal inputs (e.g., the concatenation of the image and the question), for vision-language (VL) representation learning. Existing approaches typically design an individual network for each modality and/or a specific fusion network for multimodal tasks. To simplify the network architecture, we use a single transformer network and enforce multi-task learning during VL pre-training, which includes the image-text contrastive loss, image-text matching loss, and masked language modeling loss based on the bidirectional and the seq2seq attention mask. The same transformer network is used as the image encoder, the text encoder, or the fusion network in different pre-training tasks. Empirically, we observe less conflict among different tasks and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Sequence to Sequence
