SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

Chris Choy; Junha Lee; Chunghyun Park; Minsu Cho; Jan Kautz

arXiv:2604.20395·cs.CV·April 23, 2026

SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

Chris Choy, Junha Lee, Chunghyun Park, Minsu Cho, Jan Kautz

PDF

TL;DR

SpaCeFormer is a fast, proposal-free 3D instance segmentation method that significantly outperforms prior approaches in speed and accuracy, enabling real-time applications in robotics and AR/VR.

Contribution

It introduces SpaCeFormer, a novel transformer-based model that predicts 3D instance masks directly without external proposals, and provides the largest open-vocabulary 3D segmentation dataset.

Findings

01

Runs at 0.14 seconds per scene, 2-3 orders faster than previous methods.

02

Achieves 21x higher mask recall than prior single-view pipelines.

03

Surpasses all prior methods on multiple datasets in zero-shot mAP.

Abstract

Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs at 0.14 seconds per scene, 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21x higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU > 0.5). SpaCeFormer combines spatial window attention with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.