Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving

Zongchuang Zhao; Haoyu Fu; Dingkang Liang; Xin Zhou; Dingyuan Zhang; Hongwei Xie; Bing Wang; Xiang Bai

arXiv:2505.08725·cs.CV·May 14, 2025

Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving

Zongchuang Zhao, Haoyu Fu, Dingkang Liang, Xin Zhou, Dingyuan Zhang, Hongwei Xie, Bing Wang, Xiang Bai

PDF

1 Repo

TL;DR

This paper introduces NuInteract, a large-scale multi-view dataset, and DriveMonkey, a framework integrating LVLMs with 3D perception, to enhance comprehensive scene understanding and interactive tasks in autonomous driving.

Contribution

The paper presents a novel dataset and a framework that effectively combine LVLMs with 3D perception for autonomous driving tasks.

Findings

01

DriveMonkey outperforms general LVLMs in 3D visual grounding.

02

The dataset enables diverse interactive scene understanding.

03

Integration of 3D detectors improves perception accuracy.

Abstract

The Large Visual-Language Models (LVLMs) have significantly advanced image understanding. Their comprehension and reasoning capabilities enable promising applications in autonomous driving scenarios. However, existing research typically focuses on front-view perspectives and partial objects within scenes, struggling to achieve comprehensive scene understanding. Meanwhile, existing LVLMs suffer from the lack of mapping relationship between 2D and 3D and insufficient integration of 3D object localization and instruction understanding. To tackle these limitations, we first introduce NuInteract, a large-scale dataset with over 1.5M multi-view image language pairs spanning dense scene captions and diverse interactive tasks. Furthermore, we propose DriveMonkey, a simple yet effective framework that seamlessly integrates LVLMs with a spatial processor using a series of learnable queries. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zc-zhao/drivemonkey
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.