VESPA: Towards un(Human)supervised Open-World Pointcloud Labeling for Autonomous Driving
Levente Tempfli, Esteban Rivera, Markus Lienkamp

TL;DR
VESPA introduces a multimodal autolabeling pipeline that combines LiDAR and camera data with vision-language models to enable open-vocabulary, high-quality 3D labeling for autonomous driving, reducing reliance on manual annotation.
Contribution
The paper presents VESPA, a novel autolabeling method that fuses LiDAR and camera data with vision-language models for open-vocabulary 3D object detection without ground-truth annotations.
Findings
Achieves 52.95% AP for object discovery on Nuscenes
Reaches up to 46.54% AP for multiclass detection
Supports discovery of novel categories in 3D scenes
Abstract
Data collection for autonomous driving is rapidly accelerating, but manual annotation, especially for 3D labels, remains a major bottleneck due to its high cost and labor intensity. Autolabeling has emerged as a scalable alternative, allowing the generation of labels for point clouds with minimal human intervention. While LiDAR-based autolabeling methods leverage geometric information, they struggle with inherent limitations of lidar data, such as sparsity, occlusions, and incomplete object observations. Furthermore, these methods typically operate in a class-agnostic manner, offering limited semantic granularity. To address these challenges, we introduce VESPA, a multimodal autolabeling pipeline that fuses the geometric precision of LiDAR with the semantic richness of camera images. Our approach leverages vision-language models (VLMs) to enable open-vocabulary object labeling and to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Advanced Neural Network Applications · Robotics and Sensor-Based Localization
