BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving
Felix Brandstaetter, Erik Schuetz, Katharina Winter, Fabian Flohr

TL;DR
BEV-LLM is a lightweight 3D scene captioning model for autonomous driving that combines LiDAR and multi-view images, achieving competitive results and introducing new datasets for diverse scenario evaluation.
Contribution
The paper presents BEV-LLM, a novel multimodal 3D captioning model leveraging BEVFusion and positional encoding, along with two new datasets for comprehensive scene understanding.
Findings
BEV-LLM surpasses state-of-the-art BLEU scores by up to 5%.
Introduces nuView and GroundView datasets for diverse scenario benchmarking.
Demonstrates effectiveness of combining LiDAR and multi-view images in captioning.
Abstract
Autonomous driving technology has the potential to transform transportation, but its wide adoption depends on the development of interpretable and transparent decision-making systems. Scene captioning, which generates natural language descriptions of the driving environment, plays a crucial role in enhancing transparency, safety, and human-AI interaction. We introduce BEV-LLM, a lightweight model for 3D captioning of autonomous driving scenes. BEV-LLM leverages BEVFusion to combine 3D LiDAR point clouds and multi-view images, incorporating a novel absolute positional encoding for view-specific scene descriptions. Despite using a small 1B parameter base model, BEV-LLM achieves competitive performance on the nuCaption dataset, surpassing state-of-the-art by up to 5\% in BLEU scores. Additionally, we release two new datasets - nuView (focused on environmental conditions and viewpoints) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
