CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale Point Cloud Data
Taiki Miyanishi, Fumiya Kitamori, Shuhei Kurita, Jungdae Lee, Motoaki, Kawanabe, Nakamasa Inoue

TL;DR
CityRefer is a large, manually verified dataset of 3D city scene descriptions and labels, enabling improved visual grounding for outdoor urban environments, supporting autonomous navigation and urban analysis.
Contribution
The paper introduces the CityRefer dataset, the largest city-scale 3D visual grounding dataset with manual annotations, and a baseline system for 3D object localization using language and geographic data.
Findings
CityRefer contains 35,000 descriptions and 5,000 landmark labels.
The dataset is manually verified for quality and accuracy.
A baseline system demonstrates effective visual grounding on city-scale 3D data.
Abstract
City-scale 3D point cloud is a promising way to express detailed and complicated outdoor structures. It encompasses both the appearance and geometry features of segmented city components, including cars, streets, and buildings, that can be utilized for attractive applications such as user-interactive navigation of autonomous vehicles and drones. However, compared to the extensive text annotations available for images and indoor scenes, the scarcity of text annotations for outdoor scenes poses a significant challenge for achieving these applications. To tackle this problem, we introduce the CityRefer dataset for city-level visual grounding. The dataset consists of 35k natural language descriptions of 3D objects appearing in SensatUrban city scenes and 5k landmarks labels synchronizing with OpenStreetMap. To ensure the quality and accuracy of the dataset, all descriptions and labels in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · 3D Surveying and Cultural Heritage · Advanced Image and Video Retrieval Techniques
