Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

Haicheng Liao; Huanming Shen; Bonan Wang; Yongkang Li; Yihong Tang; Chengyue Wang; Dingyi Zhuang; Kehua Chen; Hai Yang; Chengzhong Xu; Zhenning Li

arXiv:2512.03454·cs.CV·May 11, 2026

Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

Haicheng Liao, Huanming Shen, Bonan Wang, Yongkang Li, Yihong Tang, Chengyue Wang, Dingyi Zhuang, Kehua Chen, Hai Yang, Chengzhong Xu, Zhenning Li

PDF

TL;DR

ThinkDeeper introduces a world model-inspired multimodal grounding framework for autonomous vehicles, reasoning about future spatial states to improve natural-language command interpretation and object localization.

Contribution

It proposes a novel Spatial-Aware World Model and a hierarchical decoder, advancing visual grounding in autonomous driving with superior robustness and efficiency.

Findings

01

Ranks #1 on the Talk2Car leaderboard.

02

Outperforms state-of-the-art on DrivePilot, MoCAD, and RefCOCO benchmarks.

03

Maintains high performance with limited training data.

Abstract

Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.