OG: Equip vision occupancy with instance segmentation and visual grounding
Zichao Dong, Hang Ji, Weikun Zhang, Xufeng Huang, Junbo Chen

TL;DR
This paper introduces Occupancy Grounding (OG), a novel method that enhances 3D occupancy prediction with instance segmentation and visual grounding capabilities, enabling more detailed and grounded perception in 3D scenes.
Contribution
OG is the first approach to integrate instance segmentation and visual grounding into voxel-based occupancy prediction, addressing key limitations of previous semantic segmentation methods.
Findings
Achieved effective 3D instance segmentation in occupancy maps.
Demonstrated successful visual grounding in voxel space.
Validated approach through extensive experiments and visualizations.
Abstract
Occupancy prediction tasks focus on the inference of both geometry and semantic labels for each voxel, which is an important perception mission. However, it is still a semantic segmentation task without distinguishing various instances. Further, although some existing works, such as Open-Vocabulary Occupancy (OVO), have already solved the problem of open vocabulary detection, visual grounding in occupancy has not been solved to the best of our knowledge. To tackle the above two limitations, this paper proposes Occupancy Grounding (OG), a novel method that equips vanilla occupancy instance segmentation ability and could operate visual grounding in a voxel manner with the help of grounded-SAM. Keys to our approach are (1) affinity field prediction for instance clustering and (2) association strategy for aligning 2D instance masks and 3D occupancy instances. Extensive experiments have been…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsFocus
