Towards Physics-informed Spatial Intelligence with Human Priors: An Autonomous Driving Pilot Study
Guanlin Wu, Boyan Su, Yang Zhao, Pu Wang, Yichen Lin, and Hao Frank Yang

TL;DR
This paper introduces Spatial Intelligence Grid (SIG), a structured scene representation that improves the evaluation and learning of visual-spatial intelligence in foundation models, especially for autonomous driving applications.
Contribution
The paper proposes SIG as a novel grid-based schema for explicit spatial encoding, along with SIG-informed metrics and a new benchmark, SIGBench, to enhance spatial reasoning in foundation models.
Findings
SIG improves VSI metric performance across models.
SIG yields more stable and comprehensive spatial reasoning.
SIGBench provides extensive annotated data for autonomous driving scenarios.
Abstract
How to integrate and verify spatial intelligence in foundation models remains an open challenge. Current practice often proxies Visual-Spatial Intelligence (VSI) with purely textual prompts and VQA-style scoring, which obscures geometry, invites linguistic shortcuts, and weakens attribution to genuinely spatial skills. We introduce Spatial Intelligence Grid (SIG): a structured, grid-based schema that explicitly encodes object layouts, inter-object relations, and physically grounded priors. As a complementary channel to text, SIG provides a faithful, compositional representation of scene structure for foundation-model reasoning. Building on SIG, we derive SIG-informed evaluation metrics that quantify a model's intrinsic VSI, which separates spatial capability from language priors. In few-shot in-context learning with state-of-the-art multimodal LLMs (e.g. GPT- and Gemini-family models),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
