TL;DR
SpaceDrive introduces a spatial-aware vision language model framework for autonomous driving, explicitly encoding 3D spatial information to improve reasoning and planning accuracy in complex environments.
Contribution
It proposes a novel method of using explicit 3D positional encodings in VLMs, enhancing spatial reasoning and trajectory prediction in autonomous driving tasks.
Findings
Achieves state-of-the-art open-loop performance on nuScenes dataset.
Attains second-best Driving Score of 78.02 on Bench2Drive benchmark.
Demonstrates improved spatial reasoning and planning accuracy.
Abstract
End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Human Motion and Animation
