Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected   Multi-Modal Large Models

Xinpeng Ding; Jinahua Han; Hang Xu; Xiaodan Liang; Wei; Zhang; Xiaomeng Li

arXiv:2401.00988·cs.CV·January 3, 2024·5 cites

Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models

Xinpeng Ding, Jinahua Han, Hang Xu, Xiaodan Liang, Wei, Zhang, Xiaomeng Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces NuInstruct, a large multi-view video dataset for autonomous driving tasks, and proposes BEV-InMLLM, a model that integrates bird's-eye-view features with language models to improve driving understanding.

Contribution

The paper presents a new dataset NuInstruct with 91K multi-view video-QA pairs and a novel BEV-InMLLM model that effectively combines multi-view, spatial, and temporal information for autonomous driving tasks.

Findings

01

BEV-InMLLM outperforms existing MLLMs by around 9% on NuInstruct tasks.

02

NuInstruct significantly challenges models with its multi-view and temporal complexity.

03

The proposed BEV injection module is effective and versatile for existing MLLMs.

Abstract

The rise of multimodal large language models (MLLMs) has spurred interest in language-based driving tasks. However, existing research typically focuses on limited tasks and often omits key multi-view and temporal information which is crucial for robust autonomous driving. To bridge these gaps, we introduce NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks, where each task demands holistic information (e.g., temporal, multi-view, and spatial), significantly elevating the challenge level. To obtain NuInstruct, we propose a novel SQL-based method to generate instruction-response pairs automatically, which is inspired by the driving logical progression of humans. We further present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View (BEV) features, language-aligned for large language models. BEV-InMLLM integrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xmed-lab/nuinstruct
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling