Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models

Xiaoyan Wang; Zeju Li; Yifan Xu; Jiaxing Qi; Zhifei Yang; Ruifei Ma; Xiangde Liu; Chao Zhang

arXiv:2507.16524·cs.CV·July 23, 2025

Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models

Xiaoyan Wang, Zeju Li, Yifan Xu, Jiaxing Qi, Zhifei Yang, Ruifei Ma, Xiangde Liu, Chao Zhang

PDF

Open Access

TL;DR

Spatial 3D-LLM enhances spatial awareness in 3D vision-language models by enriching spatial embeddings through a progressive scheme, enabling better performance on new 3D tasks and datasets.

Contribution

The paper introduces a novel 3D multimodal LLM with a progressive spatial awareness scheme and new 3D tasks and dataset for evaluating spatial understanding.

Findings

01

Achieves state-of-the-art results on 3D vision-language tasks.

02

Demonstrates improved spatial understanding through progressive embedding scheme.

03

Introduces new tasks and dataset for 3D spatial awareness evaluation.

Abstract

New era has unlocked exciting possibilities for extending Large Language Models (LLMs) to tackle 3D vision-language tasks. However, most existing 3D multimodal LLMs (MLLMs) rely on compressing holistic 3D scene information or segmenting independent objects to perform these tasks, which limits their spatial awareness due to insufficient representation of the richness inherent in 3D scenes. To overcome these limitations, we propose Spatial 3D-LLM, a 3D MLLM specifically designed to enhance spatial awareness for 3D vision-language tasks by enriching the spatial embeddings of 3D scenes. Spatial 3D-LLM integrates an LLM backbone with a progressive spatial awareness scheme that progressively captures spatial information as the perception field expands, generating location-enriched 3D scene embeddings to serve as visual prompts. Furthermore, we introduce two novel tasks: 3D object distance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques