FAST3DIS: Feed-forward Anchored Scene Transformer for 3D Instance Segmentation
Changyang Li, Xueqing Huang, Shin-Fang Chng, Huangying Zhan, Qingan Yan, Yi Xu

TL;DR
FAST3DIS is an end-to-end 3D instance segmentation method using a Transformer architecture that avoids clustering, improves efficiency, and maintains geometric priors for better scene understanding.
Contribution
The paper introduces a novel query-based Transformer architecture with 3D anchors and contrastive learning for efficient, accurate 3D instance segmentation without post-hoc clustering.
Findings
Achieves competitive accuracy on indoor 3D datasets.
Offers improved memory scalability and inference speed.
Effectively prevents query collisions with dual-level regularization.
Abstract
While recent feed-forward 3D reconstruction models provide a strong geometric foundation for scene understanding, extending them to 3D instance segmentation typically relies on a disjointed "lift-and-cluster" paradigm. Grouping dense pixel-wise embeddings via non-differentiable clustering scales poorly with the number of views and disconnects representation learning from the final segmentation objective. In this paper, we present a Feed-forward Anchored Scene Transformer for 3D Instance Segmentation (FAST3DIS), an end-to-end approach that effectively bypasses post-hoc clustering. We introduce a 3D-anchored, query-based Transformer architecture built upon a foundational depth backbone, adapted efficiently to learn instance-specific semantics while retaining its zero-shot geometric priors. We formulate a learned 3D anchor generator coupled with an anchor-sampling cross-attention mechanism…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
