VEON: Vocabulary-Enhanced Occupancy Prediction
Jilai Zheng, Pin Tang, Zhongdao Wang, Guoqing Wang, Xiangxuan Ren,, Bailan Feng, Chao Ma

TL;DR
VEON is a novel framework that combines and adapts foundation models like MiDaS and CLIP to predict 3D occupancy with open-vocabulary semantics, addressing depth ambiguity, low-resolution features, and long-tail class issues.
Contribution
It introduces a method to adapt existing foundation models for open-vocabulary 3D occupancy prediction with minimal parameters and no manual semantic labels.
Findings
Achieves 15.14 mIoU on Occ3D-nuScenes dataset.
Demonstrates capability for open-vocabulary object recognition.
Uses only 46M trainable parameters.
Abstract
Perceiving the world as 3D occupancy supports embodied agents to avoid collision with any types of obstacle. While open-vocabulary image understanding has prospered recently, how to bind the predicted 3D occupancy grids with open-world semantics still remains under-explored due to limited open-world annotations. Hence, instead of building our model from scratch, we try to blend 2D foundation models, specifically a depth model MiDaS and a semantic model CLIP, to lift the semantics to 3D space, thus fulfilling 3D occupancy. However, building upon these foundation models is not trivial. First, the MiDaS faces the depth ambiguity problem, i.e., it only produces relative depth but fails to estimate bin depth for feature lifting. Second, the CLIP image features lack high-resolution pixel-level information, which limits the 3D occupancy accuracy. Third, open vocabulary is often trapped by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Data Management and Algorithms
MethodsContrastive Language-Image Pre-training
