BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving

Felix Brandstaetter; Erik Schuetz; Katharina Winter; Fabian Flohr

arXiv:2507.19370·cs.CV·July 28, 2025

BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving

Felix Brandstaetter, Erik Schuetz, Katharina Winter, Fabian Flohr

PDF

TL;DR

BEV-LLM is a lightweight 3D scene captioning model for autonomous driving that combines LiDAR and multi-view images, achieving competitive results and introducing new datasets for diverse scenario evaluation.

Contribution

The paper presents BEV-LLM, a novel multimodal 3D captioning model leveraging BEVFusion and positional encoding, along with two new datasets for comprehensive scene understanding.

Findings

01

BEV-LLM surpasses state-of-the-art BLEU scores by up to 5%.

02

Introduces nuView and GroundView datasets for diverse scenario benchmarking.

03

Demonstrates effectiveness of combining LiDAR and multi-view images in captioning.

Abstract

Autonomous driving technology has the potential to transform transportation, but its wide adoption depends on the development of interpretable and transparent decision-making systems. Scene captioning, which generates natural language descriptions of the driving environment, plays a crucial role in enhancing transparency, safety, and human-AI interaction. We introduce BEV-LLM, a lightweight model for 3D captioning of autonomous driving scenes. BEV-LLM leverages BEVFusion to combine 3D LiDAR point clouds and multi-view images, incorporating a novel absolute positional encoding for view-specific scene descriptions. Despite using a small 1B parameter base model, BEV-LLM achieves competitive performance on the nuCaption dataset, surpassing state-of-the-art by up to 5\% in BLEU scores. Additionally, we release two new datasets - nuView (focused on environmental conditions and viewpoints) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.