Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning
Rui Zhao, Qirui Yuan, Jinyu Li, Haofeng Hu, Yun Li, Chengyuan Zheng,, Fei Gao

TL;DR
Sce2DriveX is a multimodal large language model framework that enhances autonomous driving by integrating scene understanding, reasoning, and control, achieving state-of-the-art results and robust generalization across diverse driving scenarios.
Contribution
It introduces Sce2DriveX, a novel multimodal LLM framework with a chain-of-thought reasoning process for end-to-end autonomous driving, incorporating a new VQA dataset for 3D spatial understanding.
Findings
Achieves state-of-the-art performance in scene understanding and driving tasks.
Demonstrates robust generalization across different driving scenes.
Develops the first extensive VQA dataset for 3D spatial reasoning in driving.
Abstract
End-to-end autonomous driving, which directly maps raw sensor inputs to low-level vehicle controls, is an important part of Embodied AI. Despite successes in applying Multimodal Large Language Models (MLLMs) for high-level traffic scene semantic understanding, it remains challenging to effectively translate these conceptual semantics understandings into low-level motion control commands and achieve generalization and consensus in cross-scene driving. We introduce Sce2DriveX, a human-like driving chain-of-thought (CoT) reasoning MLLM framework. Sce2DriveX utilizes multimodal joint learning from local scene videos and global BEV maps to deeply understand long-range spatiotemporal relationships and road topology, enhancing its comprehensive perception and reasoning capabilities in 3D dynamic/static scenes and achieving driving generalization across scenes. Building on this, it reconstructs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsEntropy Regularization · Proximal Policy Optimization · CARLA: An Open Urban Driving Simulator
