Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning

Rui Zhao; Qirui Yuan; Jinyu Li; Haofeng Hu; Yun Li; Chengyuan Zheng,; Fei Gao

arXiv:2502.14917·cs.CV·February 24, 2025

Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning

Rui Zhao, Qirui Yuan, Jinyu Li, Haofeng Hu, Yun Li, Chengyuan Zheng,, Fei Gao

PDF

TL;DR

Sce2DriveX is a multimodal large language model framework that enhances autonomous driving by integrating scene understanding, reasoning, and control, achieving state-of-the-art results and robust generalization across diverse driving scenarios.

Contribution

It introduces Sce2DriveX, a novel multimodal LLM framework with a chain-of-thought reasoning process for end-to-end autonomous driving, incorporating a new VQA dataset for 3D spatial understanding.

Findings

01

Achieves state-of-the-art performance in scene understanding and driving tasks.

02

Demonstrates robust generalization across different driving scenes.

03

Develops the first extensive VQA dataset for 3D spatial reasoning in driving.

Abstract

End-to-end autonomous driving, which directly maps raw sensor inputs to low-level vehicle controls, is an important part of Embodied AI. Despite successes in applying Multimodal Large Language Models (MLLMs) for high-level traffic scene semantic understanding, it remains challenging to effectively translate these conceptual semantics understandings into low-level motion control commands and achieve generalization and consensus in cross-scene driving. We introduce Sce2DriveX, a human-like driving chain-of-thought (CoT) reasoning MLLM framework. Sce2DriveX utilizes multimodal joint learning from local scene videos and global BEV maps to deeply understand long-range spatiotemporal relationships and road topology, enhancing its comprehensive perception and reasoning capabilities in 3D dynamic/static scenes and achieving driving generalization across scenes. Building on this, it reconstructs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsEntropy Regularization · Proximal Policy Optimization · CARLA: An Open Urban Driving Simulator