Generalist Multimodal AI: A Review of Architectures, Challenges and Opportunities
Sai Munikoti, Ian Stewart, Sameera Horawalavithana, Henry Kvinge,, Tegan Emerson, Sandra E Thompson, Karl Pazdernik

TL;DR
This paper reviews the development of generalist multimodal models, analyzing architectures, challenges, and opportunities to guide future research in creating versatile AI systems across multiple data modalities.
Contribution
It introduces a novel taxonomy based on architecture and training configurations, focusing on Unifiability, Modularity, and Adaptability for GMMs.
Findings
Identifies key architectural factors for GMMs
Highlights challenges in extending models beyond text and vision
Provides a roadmap for future multimodal AI research
Abstract
Multimodal models are expected to be a critical component to future advances in artificial intelligence. This field is starting to grow rapidly with a surge of new design elements motivated by the success of foundation models in natural language processing (NLP) and vision. It is widely hoped that further extending the foundation models to multiple modalities (e.g., text, image, video, sensor, time series, graph, etc.) will ultimately lead to generalist multimodal models, i.e. one model across different data modalities and tasks. However, there is little research that systematically analyzes recent multimodal models (particularly the ones that work beyond text and vision) with respect to the underling architecture proposed. Therefore, this work provides a fresh perspective on generalist multimodal models (GMMs) via a novel architecture and training configuration specific taxonomy. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · AI-based Problem Solving and Planning · Semantic Web and Ontologies
