Vision-Language Intelligence: Tasks, Representation Learning, and Large   Models

Feng Li; Hao Zhang; Yi-Fan Zhang; Shilong Liu; Jian Guo; Lionel M. Ni,; PengChuan Zhang; Lei Zhang

arXiv:2203.01922·cs.CV·March 4, 2022·30 cites

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

Feng Li, Hao Zhang, Yi-Fan Zhang, Shilong Liu, Jian Guo, Lionel M. Ni,, PengChuan Zhang, Lei Zhang

PDF

Open Access

TL;DR

This survey reviews the evolution of vision-language intelligence, highlighting task-specific methods, pre-training techniques, and large-scale models that leverage raw image-text data for improved multimodal understanding.

Contribution

It provides a comprehensive chronological overview of vision-language research, emphasizing recent advances in large-scale models and data utilization for better generalization.

Findings

01

Progression from task-specific to large-scale models

02

Effective use of raw image-text data for zero-shot learning

03

Emerging trends in modality cooperation and unified representations

Abstract

This paper presents a comprehensive survey of vision-language (VL) intelligence from the perspective of time. This survey is inspired by the remarkable progress in both computer vision and natural language processing, and recent trends shifting from single modality processing to multiple modality comprehension. We summarize the development in this field into three time periods, namely task-specific methods, vision-language pre-training (VLP) methods, and larger models empowered by large-scale weakly-labeled data. We first take some common VL tasks as examples to introduce the development of task-specific methods. Then we focus on VLP methods and comprehensively review key components of the model structures and training methods. After that, we show how recent work utilizes large-scale raw image-text data to learn language-aligned visual representations that generalize better on zero or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques