City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning

Penglei Sun; Yaoxian Song; Xiangru Zhu; Xiang Liu; Qiang Wang; Yue Liu; Changqun Xia; Tiefeng Li; Yang Yang; Xiaowen Chu

arXiv:2507.12795·cs.CV·July 18, 2025

City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning

Penglei Sun, Yaoxian Song, Xiangru Zhu, Xiang Liu, Qiang Wang, Yue Liu, Changqun Xia, Tiefeng Li, Yang Yang, Xiaowen Chu

PDF

Open Access

TL;DR

City-VLM introduces a novel multimodal outdoor scene understanding model that effectively fuses incomplete 2D and 3D data across diverse large-scale environments, significantly improving question-answering accuracy.

Contribution

The paper presents the first multidomain outdoor scene dataset SVM-City and a multimodal learning framework City-VLM that handles incomplete data without explicit fusion operations.

Findings

01

City-VLM outperforms existing LVLMs by 18.14% in question-answering tasks.

02

Introduces a large-scale outdoor scene dataset with 420k images and 4.8 billion point clouds.

03

Demonstrates strong generalization across various outdoor environments.

Abstract

Scene understanding enables intelligent agents to interpret and comprehend their environment. While existing large vision-language models (LVLMs) for scene understanding have primarily focused on indoor household tasks, they face two significant limitations when applied to outdoor large-scale scene understanding. First, outdoor scenarios typically encompass larger-scale environments observed through various sensors from multiple viewpoints (e.g., bird view and terrestrial view), while existing indoor LVLMs mainly analyze single visual modalities within building-scale contexts from humanoid viewpoints. Second, existing LVLMs suffer from missing multidomain perception outdoor data and struggle to effectively integrate 2D and 3D visual information. To address the aforementioned limitations, we build the first multidomain perception outdoor scene understanding dataset, named…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques