Foundational Models Defining a New Era in Vision: A Survey and Outlook

Muhammad Awais; Muzammal Naseer; Salman Khan; Rao Muhammad Anwer,; Hisham Cholakkal; Mubarak Shah; Ming-Hsuan Yang; Fahad Shahbaz Khan

arXiv:2307.13721·cs.CV·July 27, 2023·68 cites

Foundational Models Defining a New Era in Vision: A Survey and Outlook

Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer,, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Fahad Shahbaz Khan

PDF

Open Access 1 Repo

TL;DR

This survey comprehensively reviews foundational models in vision, highlighting their architectures, training methods, applications, challenges, and future research directions in multimodal understanding and reasoning.

Contribution

It provides a systematic overview of foundational vision models, including design, training, prompting, and evaluation, along with discussing open challenges and future research avenues.

Findings

01

Wide adoption of multimodal architectures and training objectives.

02

Emerging applications across vision, language, and audio modalities.

03

Identified challenges in evaluation, bias, and interpretability.

Abstract

Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundational models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

awaisrauf/awesome-cv-foundational-models
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications