Probing Multimodal LLMs as World Models for Driving

Shiva Sreeram; Tsun-Hsuan Wang; Alaa Maalouf; Guy Rosman; Sertac; Karaman; Daniela Rus

arXiv:2405.05956·cs.RO·October 29, 2024

Probing Multimodal LLMs as World Models for Driving

Shiva Sreeram, Tsun-Hsuan Wang, Alaa Maalouf, Guy Rosman, Sertac, Karaman, Daniela Rus

PDF

Open Access 1 Repo

TL;DR

This paper critically evaluates Multimodal Large Language Models' ability to serve as world models for autonomous driving, revealing strengths in image interpretation but significant challenges in scene understanding and dynamic reasoning.

Contribution

It introduces Eval-LLM-Drive and DriveSim for comprehensive assessment of MLLMs in driving scenarios, exposing current limitations and guiding future improvements.

Findings

01

MLLMs interpret individual images well

02

Struggle to synthesize coherent scene narratives

03

Significant inaccuracies in dynamic scene understanding

Abstract

We provide a sober look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving, challenging common assumptions about their ability to interpret dynamic driving scenarios. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored. Our experimental study assesses various MLLMs as world models using in-car camera perspectives and reveals that while these models excel at interpreting individual images, they struggle to synthesize coherent narratives across frames, leading to considerable inaccuracies in understanding (i) ego vehicle dynamics, (ii) interactions with other road actors, (iii) trajectory planning, and (iv) open-set scene reasoning. We introduce the Eval-LLM-Drive dataset and DriveSim simulator to enhance our evaluation, highlighting gaps in current MLLM capabilities and the need for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sreeramsa/drivesim
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies