Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes
Jeongho Noh, Tai Hyoung Rhee, Eunho Lee, Jeongyun Kim, Sunwoo Lee, Ayoung Kim

TL;DR
Clutt3R-Seg is a novel zero-shot 3D instance segmentation method that leverages hierarchical semantic cues and open-vocabulary embeddings to improve robotic grasping in cluttered scenes with sparse views.
Contribution
It introduces a hierarchical instance tree that refines noisy masks using semantic cues and a consistency-aware update for scene changes, enabling robust, view-consistent 3D segmentation in cluttered environments.
Findings
Outperforms state-of-the-art baselines in cluttered scenes.
Achieves AP@25 of 61.66 in heavy clutter, over 2.2x higher than baselines.
Surpasses MaskClustering with fewer views by over 2x.
Abstract
Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view grouping and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications
