Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
Feng Li, Hao Zhang, Yi-Fan Zhang, Shilong Liu, Jian Guo, Lionel M. Ni,, PengChuan Zhang, Lei Zhang

TL;DR
This survey reviews the evolution of vision-language intelligence, highlighting task-specific methods, pre-training techniques, and large-scale models that leverage raw image-text data for improved multimodal understanding.
Contribution
It provides a comprehensive chronological overview of vision-language research, emphasizing recent advances in large-scale models and data utilization for better generalization.
Findings
Progression from task-specific to large-scale models
Effective use of raw image-text data for zero-shot learning
Emerging trends in modality cooperation and unified representations
Abstract
This paper presents a comprehensive survey of vision-language (VL) intelligence from the perspective of time. This survey is inspired by the remarkable progress in both computer vision and natural language processing, and recent trends shifting from single modality processing to multiple modality comprehension. We summarize the development in this field into three time periods, namely task-specific methods, vision-language pre-training (VLP) methods, and larger models empowered by large-scale weakly-labeled data. We first take some common VL tasks as examples to introduce the development of task-specific methods. Then we focus on VLP methods and comprehensively review key components of the model structures and training methods. After that, we show how recent work utilizes large-scale raw image-text data to learn language-aligned visual representations that generalize better on zero or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
