Vision-Language Models for Vision Tasks: A Survey

Jingyi Zhang; Jiaxing Huang; Sheng Jin; Shijian Lu

arXiv:2304.00685·cs.CV·February 19, 2024·34 cites

Vision-Language Models for Vision Tasks: A Survey

Jingyi Zhang, Jiaxing Huang, Sheng Jin, Shijian Lu

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This survey reviews the development, architectures, datasets, methods, and challenges of vision-language models (VLMs) that enable zero-shot visual recognition, highlighting their potential to transform traditional paradigms.

Contribution

It systematically categorizes existing VLM pre-training, transfer learning, and knowledge distillation methods, providing comprehensive analysis and future research directions.

Findings

01

VLMs enable zero-shot recognition across various tasks.

02

Pre-training datasets and architectures significantly impact VLM performance.

03

Benchmarking reveals strengths and limitations of current VLM approaches.

Abstract

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jingyi0000/vlm_survey
tfOfficial

Datasets

BAAI/SurveyScope
dataset· 6 dl
6 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsKnowledge Distillation