Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models
Xinpeng Ding, Jinahua Han, Hang Xu, Xiaodan Liang, Wei, Zhang, Xiaomeng Li

TL;DR
This paper introduces NuInstruct, a large multi-view video dataset for autonomous driving tasks, and proposes BEV-InMLLM, a model that integrates bird's-eye-view features with language models to improve driving understanding.
Contribution
The paper presents a new dataset NuInstruct with 91K multi-view video-QA pairs and a novel BEV-InMLLM model that effectively combines multi-view, spatial, and temporal information for autonomous driving tasks.
Findings
BEV-InMLLM outperforms existing MLLMs by around 9% on NuInstruct tasks.
NuInstruct significantly challenges models with its multi-view and temporal complexity.
The proposed BEV injection module is effective and versatile for existing MLLMs.
Abstract
The rise of multimodal large language models (MLLMs) has spurred interest in language-based driving tasks. However, existing research typically focuses on limited tasks and often omits key multi-view and temporal information which is crucial for robust autonomous driving. To bridge these gaps, we introduce NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks, where each task demands holistic information (e.g., temporal, multi-view, and spatial), significantly elevating the challenge level. To obtain NuInstruct, we propose a novel SQL-based method to generate instruction-response pairs automatically, which is inspired by the driving logical progression of humans. We further present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View (BEV) features, language-aligned for large language models. BEV-InMLLM integrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
