LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving

Hao Shao; Letian Wang; Yang Zhou; Yuxuan Hu; Zhuofan Zong; Steven L. Waslander; Wei Zhan; and Hongsheng Li

arXiv:2604.08719·cs.CV·April 13, 2026

LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving

Hao Shao, Letian Wang, Yang Zhou, Yuxuan Hu, Zhuofan Zong, Steven L. Waslander, Wei Zhan, and Hongsheng Li

PDF

TL;DR

LMGenDrive is a novel framework that unifies multimodal understanding and generative scene modeling to enhance autonomous driving in complex, open-world scenarios.

Contribution

It introduces the first end-to-end model combining LLM-based understanding with generative world models for driving, supported by a progressive training strategy.

Findings

01

Outperforms prior methods on challenging benchmarks.

02

Improves instruction following and scene understanding.

03

Enhances robustness to rare and safety-critical scenarios.

Abstract

Recent years have seen remarkable progress in autonomous driving, yet generalization to long-tail and open-world scenarios remains a major bottleneck for large-scale deployment. To address this challenge, some works use LLMs and VLMs for vision-language understanding and reasoning, enabling vehicles to interpret rare and safety-critical situations when generating actions. Others study generative world models to capture the spatio-temporal evolution of driving scenes, allowing agents to imagine possible futures before acting. Inspired by human intelligence, which unifies understanding and imagination, we explore a unified model for autonomous driving. We present LMGenDrive, the first framework that combines LLM-based multimodal understanding with generative world models for end-to-end closed-loop driving. Given multi-view camera inputs and natural-language instructions, LMGenDrive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.