SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, Yuan, Cao

TL;DR
SimVLM introduces a minimalist, weakly supervised pretraining framework for vision-language tasks, achieving state-of-the-art results without relying on expensive annotations or complex objectives.
Contribution
It presents a simple, end-to-end pretraining method using large-scale weak supervision with a single language modeling objective, outperforming prior approaches.
Findings
Outperforms previous methods on VQA, NLVR2, SNLI-VE, and image captioning.
Achieves strong zero-shot generalization and transfer capabilities.
Significantly improves benchmark scores with minimal data and complexity.
Abstract
With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction of multiple dataset-specific objectives. In this work, we relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective. Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsSimple Visual Language Model
