The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin,, Zicheng Liu, Lijuan Wang

TL;DR
This paper explores GPT-4V(ision), a large multimodal model, analyzing its capabilities, input modes, and potential applications, highlighting its unprecedented multimodal processing and generalist abilities for future research and real-world tasks.
Contribution
The paper provides a comprehensive qualitative analysis of GPT-4V(ision), revealing its multimodal generality, input flexibility, and potential for new human-computer interaction methods.
Findings
GPT-4V can process arbitrarily interleaved multimodal inputs.
GPT-4V demonstrates strong capabilities across diverse tasks and domains.
Visual marker understanding enables new interaction methods.
Abstract
Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence. In this paper, we analyze the latest model, GPT-4V(ision), to deepen the understanding of LMMs. The analysis focuses on the intriguing tasks that GPT-4V can perform, containing test samples to probe the quality and genericity of GPT-4V's capabilities, its supported inputs and working modes, and the effective ways to prompt the model. In our approach to exploring GPT-4V, we curate and organize a collection of carefully designed qualitative samples spanning a variety of domains and tasks. Observations from these samples demonstrate that GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make GPT-4V a powerful multimodal generalist system.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
RT-X and the Dawn of Large Multimodal Models: Google Breakthrough and 160-page Report Highlights· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
