OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation
Zhishan Zhou, Siyuan Wei, Zengran Wang, Chunjie Wang, Xiaosheng Yan, Xiao Liu

TL;DR
OpenTrack3D introduces a mesh-free, online proposal generation framework for open-vocabulary 3D instance segmentation, improving generalization and reasoning in unstructured environments using multi-modal large language models.
Contribution
It proposes a novel visual-spatial tracker and replaces CLIP with a multi-modal large language model to enhance accuracy and reasoning in 3D segmentation.
Findings
Achieves state-of-the-art results on ScanNet200, Replica, ScanNet++, and SceneFun3D.
Demonstrates strong generalization to diverse, unstructured environments.
Enhances compositional reasoning with a multi-modal large language model.
Abstract
Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
