MMDrive: Interactive Scene Understanding Beyond Vision with Multi-representational Fusion
Minghui Hou, Wei-Hsing Huang, Shaofeng Liang, Daizong Liu, Tai-Hao Wen, Gang Wang, Runwei Guan, Weiping Ding

TL;DR
MMDrive is a novel multimodal framework that extends scene understanding in autonomous driving from 2D images to 3D by integrating occupancy maps, LiDAR, and text, significantly improving reasoning capabilities.
Contribution
It introduces adaptive cross-modal fusion and key information extraction components, enabling 3D scene understanding beyond traditional image-based models.
Findings
Achieves state-of-the-art performance on DriveLM and NuScenes-QA benchmarks.
Demonstrates significant improvements in autonomous driving scene understanding.
Enables robust multimodal reasoning in complex environments.
Abstract
Vision-language models enable the understanding and reasoning of complex traffic scenarios through multi-source information fusion, establishing it as a core technology for autonomous driving. However, existing vision-language models are constrained by the image understanding paradigm in 2D plane, which restricts their capability to perceive 3D spatial information and perform deep semantic fusion, resulting in suboptimal performance in complex autonomous driving environments. This study proposes MMDrive, an multimodal vision-language model framework that extends traditional image understanding to a generalized 3D scene understanding framework. MMDrive incorporates three complementary modalities, including occupancy maps, LiDAR point clouds, and textual scene descriptions. To this end, it introduces two novel components for adaptive cross-modal fusion and key information extraction.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety · Advanced Neural Network Applications
