Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

Akash Ghosh; Arkadeep Acharya; Sriparna Saha; Vinija Jain; Aman Chadha

arXiv:2404.07214·cs.CV·October 15, 2025·1 cites

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

Akash Ghosh, Arkadeep Acharya, Sriparna Saha, Vinija Jain, Aman Chadha

PDF

Open Access

TL;DR

This survey comprehensively reviews current vision-language models, categorizing them by capabilities, analyzing their architectures and performance, and highlighting future research directions in integrating visual and textual understanding.

Contribution

It provides a detailed classification, analysis, and benchmarking of VLMs, offering insights into their architectures, strengths, limitations, and future research avenues.

Findings

01

VLMs are categorized into three functional groups.

02

Performance varies across different benchmark datasets.

03

Identifies key challenges and future directions in VLM research.

Abstract

The advent of Large Language Models (LLMs) has significantly reshaped the trajectory of the AI revolution. Nevertheless, these LLMs exhibit a notable limitation, as they are primarily adept at processing textual information. To address this constraint, researchers have endeavored to integrate visual capabilities with LLMs, resulting in the emergence of Vision-Language Models (VLMs). These advanced models are instrumental in tackling more intricate tasks such as image captioning and visual question answering. In our comprehensive survey paper, we delve into the key advancements within the realm of VLMs. Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.This classification is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques