A Survey of Vision-Language Pre-Trained Models
Yifan Du, Zikang Liu, Junyi Li, Wayne Xin Zhao

TL;DR
This survey reviews recent progress in Vision-Language Pre-Trained Models, covering encoding methods, architectures, pre-training tasks, and downstream applications, highlighting future research directions in multimodal learning.
Contribution
It provides a comprehensive synthesis of recent VL-PTMs, detailing encoding strategies, architectures, and tasks, serving as a guide for future research in multimodal vision-language models.
Findings
VL-PTMs have advanced significantly in recent years.
Various encoding and interaction architectures are used in VL-PTMs.
Pre-training tasks and downstream applications are diverse and evolving.
Abstract
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They have dominated the mainstream techniques in natural language processing (NLP) and computer vision (CV). How to adapt pre-training to the field of Vision-and-Language (V-L) learning and improve downstream task performance becomes a focus of multimodal learning. In this paper, we review the recent progress in Vision-Language Pre-Trained Models (VL-PTMs). As the core content, we first briefly introduce several ways to encode raw images and texts to single-modal embeddings before pre-training. Then, we dive into the mainstream architectures of VL-PTMs in modeling the interaction between text and image representations. We further present widely-used pre-training tasks, and then we introduce some common downstream tasks. We finally conclude this paper and present some promising research…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsAttention Is All You Need · Linear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Adam · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Absolute Position Encodings
