TL;DR
This paper introduces NuInteract, a large-scale multi-view dataset, and DriveMonkey, a framework integrating LVLMs with 3D perception, to enhance comprehensive scene understanding and interactive tasks in autonomous driving.
Contribution
The paper presents a novel dataset and a framework that effectively combine LVLMs with 3D perception for autonomous driving tasks.
Findings
DriveMonkey outperforms general LVLMs in 3D visual grounding.
The dataset enables diverse interactive scene understanding.
Integration of 3D detectors improves perception accuracy.
Abstract
The Large Visual-Language Models (LVLMs) have significantly advanced image understanding. Their comprehension and reasoning capabilities enable promising applications in autonomous driving scenarios. However, existing research typically focuses on front-view perspectives and partial objects within scenes, struggling to achieve comprehensive scene understanding. Meanwhile, existing LVLMs suffer from the lack of mapping relationship between 2D and 3D and insufficient integration of 3D object localization and instruction understanding. To tackle these limitations, we first introduce NuInteract, a large-scale dataset with over 1.5M multi-view image language pairs spanning dense scene captions and diverse interactive tasks. Furthermore, we propose DriveMonkey, a simple yet effective framework that seamlessly integrates LVLMs with a spatial processor using a series of learnable queries. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
