Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Zi-Yi Dou; Aishwarya Kamath; Zhe Gan; Pengchuan Zhang; Jianfeng Wang,; Linjie Li; Zicheng Liu; Ce Liu; Yann LeCun; Nanyun Peng; Jianfeng Gao; Lijuan; Wang

arXiv:2206.07643·cs.CV·November 21, 2022·67 cites

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang,, Linjie Li, Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, Jianfeng Gao, Lijuan, Wang

PDF

Open Access 1 Repo 2 Videos

TL;DR

FIBER introduces a novel vision-language model that integrates deep multimodal fusion within the backbone and employs a two-stage pre-training strategy, achieving superior performance across diverse VL tasks with efficient data usage.

Contribution

The paper proposes FIBER, a unified VL model with fusion embedded in the backbone and a two-stage pre-training approach, enabling it to excel in both high-level and region-level understanding tasks.

Findings

01

FIBER outperforms strong baselines on multiple VL tasks.

02

Deep fusion in the backbone improves memory efficiency and performance.

03

Two-stage pre-training effectively leverages coarse and fine-grained data.

Abstract

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/fiber
pytorchOfficial

Videos

Computer Vision Study Group Session on FIBER· youtube

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsTest