OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation

Zhishan Zhou; Siyuan Wei; Zengran Wang; Chunjie Wang; Xiaosheng Yan; Xiao Liu

arXiv:2512.03532·cs.CV·May 15, 2026

OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation

Zhishan Zhou, Siyuan Wei, Zengran Wang, Chunjie Wang, Xiaosheng Yan, Xiao Liu

PDF

TL;DR

OpenTrack3D introduces a mesh-free, online proposal generation framework for open-vocabulary 3D instance segmentation, improving generalization and reasoning in unstructured environments using multi-modal large language models.

Contribution

It proposes a novel visual-spatial tracker and replaces CLIP with a multi-modal large language model to enhance accuracy and reasoning in 3D segmentation.

Findings

01

Achieves state-of-the-art results on ScanNet200, Replica, ScanNet++, and SceneFun3D.

02

Demonstrates strong generalization to diverse, unstructured environments.

03

Enhances compositional reasoning with a multi-modal large language model.

Abstract

Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.