SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Zirui Wang; Jiahui Yu; Adams Wei Yu; Zihang Dai; Yulia Tsvetkov; Yuan; Cao

arXiv:2108.10904·cs.CV·May 17, 2022·342 cites

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, Yuan, Cao

PDF

Open Access 2 Repos 1 Models 2 Videos

TL;DR

SimVLM introduces a minimalist, weakly supervised pretraining framework for vision-language tasks, achieving state-of-the-art results without relying on expensive annotations or complex objectives.

Contribution

It presents a simple, end-to-end pretraining method using large-scale weak supervision with a single language modeling objective, outperforming prior approaches.

Findings

01

Outperforms previous methods on VQA, NLVR2, SNLI-VE, and image captioning.

02

Achieves strong zero-shot generalization and transfer capabilities.

03

Significantly improves benchmark scores with minimal data and complexity.

Abstract

With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction of multiple dataset-specific objectives. In this work, we relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective. Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
facebook/flava-full
model· 20k dl· ♡ 43
20k dl♡ 43

Videos

SimVLM explained | What the paper doesn’t tell you· youtube

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsSimple Visual Language Model