3D Question Answering for City Scene Understanding
Penglei Sun, Yaoxian Song, Xiang Liu, Xiaofei Yang, Qiang Wang,, Tiefeng Li, Yang Yang, and Xiaowen Chu

TL;DR
This paper introduces a new 3D question answering dataset for city scene understanding and proposes a scene graph-based method that achieves state-of-the-art accuracy in city-level 3D scene comprehension.
Contribution
The paper presents the first city-level 3D MQA dataset with semantic and interaction tasks and a novel scene graph-based method for improved city scene understanding.
Findings
Sg-CityU achieves over 63% accuracy on City-3DQA.
The dataset includes semantic and human-environment interaction tasks.
Sg-CityU outperforms indoor 3D MQA methods and zero-shot LLM approaches.
Abstract
3D multimodal question answering (MQA) plays a crucial role in scene understanding by enabling intelligent agents to comprehend their surroundings in 3D environments. While existing research has primarily focused on indoor household tasks and outdoor roadside autonomous driving tasks, there has been limited exploration of city-level scene understanding tasks. Furthermore, existing research faces challenges in understanding city scenes, due to the absence of spatial semantic information and human-environment interaction information at the city level.To address these challenges, we investigate 3D MQA from both dataset and method perspectives. From the dataset perspective, we introduce a novel 3D MQA dataset named City-3DQA for city-level scene understanding, which is the first dataset to incorporate scene semantic and human-environment interactive tasks within the city. From the method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
