VESPA: Towards un(Human)supervised Open-World Pointcloud Labeling for Autonomous Driving

Levente Tempfli; Esteban Rivera; Markus Lienkamp

arXiv:2507.20397·cs.CV·July 29, 2025

VESPA: Towards un(Human)supervised Open-World Pointcloud Labeling for Autonomous Driving

Levente Tempfli, Esteban Rivera, Markus Lienkamp

PDF

Open Access

TL;DR

VESPA introduces a multimodal autolabeling pipeline that combines LiDAR and camera data with vision-language models to enable open-vocabulary, high-quality 3D labeling for autonomous driving, reducing reliance on manual annotation.

Contribution

The paper presents VESPA, a novel autolabeling method that fuses LiDAR and camera data with vision-language models for open-vocabulary 3D object detection without ground-truth annotations.

Findings

01

Achieves 52.95% AP for object discovery on Nuscenes

02

Reaches up to 46.54% AP for multiclass detection

03

Supports discovery of novel categories in 3D scenes

Abstract

Data collection for autonomous driving is rapidly accelerating, but manual annotation, especially for 3D labels, remains a major bottleneck due to its high cost and labor intensity. Autolabeling has emerged as a scalable alternative, allowing the generation of labels for point clouds with minimal human intervention. While LiDAR-based autolabeling methods leverage geometric information, they struggle with inherent limitations of lidar data, such as sparsity, occlusions, and incomplete object observations. Furthermore, these methods typically operate in a class-agnostic manner, offering limited semantic granularity. To address these challenges, we introduce VESPA, a multimodal autolabeling pipeline that fuses the geometric precision of LiDAR with the semantic richness of camera images. Our approach leverages vision-language models (VLMs) to enable open-vocabulary object labeling and to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Advanced Neural Network Applications · Robotics and Sensor-Based Localization