Vision-Language Models for Vision Tasks: A Survey
Jingyi Zhang, Jiaxing Huang, Sheng Jin, Shijian Lu

TL;DR
This survey reviews the development, architectures, datasets, methods, and challenges of vision-language models (VLMs) that enable zero-shot visual recognition, highlighting their potential to transform traditional paradigms.
Contribution
It systematically categorizes existing VLM pre-training, transfer learning, and knowledge distillation methods, providing comprehensive analysis and future research directions.
Findings
VLMs enable zero-shot recognition across various tasks.
Pre-training datasets and architectures significantly impact VLM performance.
Benchmarking reveals strengths and limitations of current VLM approaches.
Abstract
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsKnowledge Distillation
