SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models
Shun Taguchi, Hideki Deguchi, Takumi Hamazaki, Hiroyuki Sakai

TL;DR
SpatialPrompting leverages off-the-shelf multimodal large language models with keyframe-driven prompts to perform zero-shot spatial reasoning in 3D environments, avoiding expensive fine-tuning and specialized inputs.
Contribution
It introduces a novel keyframe-driven prompt generation strategy that enables zero-shot 3D spatial reasoning using general multimodal models, surpassing existing methods.
Findings
Achieves state-of-the-art zero-shot performance on ScanQA and SQA3D datasets.
Eliminates the need for specialized 3D inputs and fine-tuning.
Provides a flexible, scalable approach to 3D spatial reasoning.
Abstract
This study introduces SpatialPrompting, a novel framework that harnesses the emergent reasoning capabilities of off-the-shelf multimodal large language models to achieve zero-shot spatial reasoning in three-dimensional (3D) environments. Unlike existing methods that rely on expensive 3D-specific fine-tuning with specialized 3D inputs such as point clouds or voxel-based features, SpatialPrompting employs a keyframe-driven prompt generation strategy. This framework uses metrics such as vision-language similarity, Mahalanobis distance, field of view, and image sharpness to select a diverse and informative set of keyframes from image sequences and then integrates them with corresponding camera pose data to effectively abstract spatial relationships and infer complex 3D structures. The proposed framework not only establishes a new paradigm for flexible spatial reasoning that utilizes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Advanced Neural Network Applications
MethodsSparse Evolutionary Training
