Examining the Commitments and Difficulties Inherent in Multimodal   Foundation Models for Street View Imagery

Zhenyuan Yang; Xuhui Lin; Qinyi He; Ziye Huang; Zhengliang Liu; Hanqi; Jiang; Peng Shu; Zihao Wu; Yiwei Li; Stephen Law; Gengchen Mai; Tianming Liu,; and Tao Yang

arXiv:2408.12821·cs.CV·August 26, 2024

Examining the Commitments and Difficulties Inherent in Multimodal Foundation Models for Street View Imagery

Zhenyuan Yang, Xuhui Lin, Qinyi He, Ziye Huang, Zhengliang Liu, Hanqi, Jiang, Peng Shu, Zihao Wu, Yiwei Li, Stephen Law, Gengchen Mai, Tianming Liu,, and Tao Yang

PDF

Open Access

TL;DR

This paper evaluates ChatGPT-4V and Gemini Pro on multimodal tasks across Street View, Built Environment, and Interior, revealing strengths in measurement and style analysis but limitations in detailed recognition and counting, highlighting their potential and challenges.

Contribution

It provides a comprehensive assessment of multimodal foundation models' capabilities and limitations across diverse real-world visual and textual tasks.

Findings

01

Proficient in length measurement and style analysis

02

Limitations in detailed recognition and counting tasks

03

Zero-shot learning shows potential but varies by domain

Abstract

The emergence of Large Language Models (LLMs) and multimodal foundation models (FMs) has generated heightened interest in their applications that integrate vision and language. This paper investigates the capabilities of ChatGPT-4V and Gemini Pro for Street View Imagery, Built Environment, and Interior by evaluating their performance across various tasks. The assessments include street furniture identification, pedestrian and car counts, and road width measurement in Street View Imagery; building function classification, building age analysis, building height analysis, and building structure classification in the Built Environment; and interior room classification, interior design style analysis, interior furniture counts, and interior length measurement in Interior. The results reveal proficiency in length measurement, style analysis, question answering, and basic image understanding,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAutomated Road and Building Extraction · Geographic Information Systems Studies