Examining the Commitments and Difficulties Inherent in Multimodal Foundation Models for Street View Imagery
Zhenyuan Yang, Xuhui Lin, Qinyi He, Ziye Huang, Zhengliang Liu, Hanqi, Jiang, Peng Shu, Zihao Wu, Yiwei Li, Stephen Law, Gengchen Mai, Tianming Liu,, and Tao Yang

TL;DR
This paper evaluates ChatGPT-4V and Gemini Pro on multimodal tasks across Street View, Built Environment, and Interior, revealing strengths in measurement and style analysis but limitations in detailed recognition and counting, highlighting their potential and challenges.
Contribution
It provides a comprehensive assessment of multimodal foundation models' capabilities and limitations across diverse real-world visual and textual tasks.
Findings
Proficient in length measurement and style analysis
Limitations in detailed recognition and counting tasks
Zero-shot learning shows potential but varies by domain
Abstract
The emergence of Large Language Models (LLMs) and multimodal foundation models (FMs) has generated heightened interest in their applications that integrate vision and language. This paper investigates the capabilities of ChatGPT-4V and Gemini Pro for Street View Imagery, Built Environment, and Interior by evaluating their performance across various tasks. The assessments include street furniture identification, pedestrian and car counts, and road width measurement in Street View Imagery; building function classification, building age analysis, building height analysis, and building structure classification in the Built Environment; and interior room classification, interior design style analysis, interior furniture counts, and interior length measurement in Interior. The results reveal proficiency in length measurement, style analysis, question answering, and basic image understanding,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutomated Road and Building Extraction · Geographic Information Systems Studies
