Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer,, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Fahad Shahbaz Khan

TL;DR
This survey comprehensively reviews foundational models in vision, highlighting their architectures, training methods, applications, challenges, and future research directions in multimodal understanding and reasoning.
Contribution
It provides a systematic overview of foundational vision models, including design, training, prompting, and evaluation, along with discussing open challenges and future research avenues.
Findings
Wide adoption of multimodal architectures and training objectives.
Emerging applications across vision, language, and audio modalities.
Identified challenges in evaluation, bias, and interpretability.
Abstract
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundational models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications
