Embodied Understanding of Driving Scenarios
Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu,, Hongzi Zhu, Minyi Guo, Yu Qiao, Hongyang Li

TL;DR
This paper introduces the Embodied Language Model (ELM), a novel framework that enhances autonomous driving scene understanding by incorporating spatial and temporal awareness, surpassing previous Vision-Language Models in performance.
Contribution
The paper presents ELM, a comprehensive embodied model with space-aware pre-training and time-aware token selection for improved driving scene understanding.
Findings
ELM outperforms previous models on the reformulated benchmark.
ELM demonstrates robust spatial localization capabilities.
ELM effectively captures long-horizon temporal cues.
Abstract
Embodied scene understanding serves as the cornerstone for autonomous agents to perceive, interpret, and respond to open driving scenarios. Such understanding is typically founded upon Vision-Language Models (VLMs). Nevertheless, existing VLMs are restricted to the 2D domain, devoid of spatial awareness and long-horizon extrapolation proficiencies. We revisit the key aspects of autonomous driving and formulate appropriate rubrics. Hereby, we introduce the Embodied Language Model (ELM), a comprehensive framework tailored for agents' understanding of driving scenes with large spatial and temporal spans. ELM incorporates space-aware pre-training to endow the agent with robust spatial localization capabilities. Besides, the model employs time-aware token selection to accurately inquire about temporal cues. We instantiate ELM on the reformulated multi-faced benchmark, and it surpasses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics
