VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation
Haochen Zhang, Nader Zantout, Pujith Kachana, Zongyuan Wu, Ji Zhang,, Wenshan Wang

TL;DR
VLA-3D is the largest real-world 3D indoor scene dataset with semantic relations and language annotations, designed to advance navigation and understanding in embodied AI systems.
Contribution
The paper introduces VLA-3D, a comprehensive 3D indoor dataset with semantic, spatial, and language annotations, enabling improved multimodal navigation research.
Findings
Benchmark results establish baseline performance for current models.
Dataset covers diverse real-world indoor scenes with detailed annotations.
Code and data are publicly available for research use.
Abstract
With the recent rise of Large Language Models (LLMs), Vision-Language Models (VLMs), and other general foundation models, there is growing potential for multimodal, multi-task embodied agents that can operate in diverse environments given only natural language as input. One such application area is indoor navigation using natural language instructions. However, despite recent progress, this problem remains challenging due to the spatial reasoning and semantic understanding required, particularly in arbitrary scenes that may contain many objects belonging to fine-grained classes. To address this challenge, we curate the largest real-world dataset for Vision and Language-guided Action in 3D Scenes (VLA-3D), consisting of over 11.5K scanned 3D indoor rooms from existing datasets, 23.5M heuristically generated semantic relations between objects, and 9.7M synthetically generated referential…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · 3D Surveying and Cultural Heritage · Advanced Image and Video Retrieval Techniques
