Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang,, Linjie Li, Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, Jianfeng Gao, Lijuan, Wang

TL;DR
FIBER introduces a novel vision-language model that integrates deep multimodal fusion within the backbone and employs a two-stage pre-training strategy, achieving superior performance across diverse VL tasks with efficient data usage.
Contribution
The paper proposes FIBER, a unified VL model with fusion embedded in the backbone and a two-stage pre-training approach, enabling it to excel in both high-level and region-level understanding tasks.
Findings
FIBER outperforms strong baselines on multiple VL tasks.
Deep fusion in the backbone improves memory efficiency and performance.
Two-stage pre-training effectively leverages coarse and fine-grained data.
Abstract
Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsTest
