Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes

Jeongho Noh; Tai Hyoung Rhee; Eunho Lee; Jeongyun Kim; Sunwoo Lee; Ayoung Kim

arXiv:2602.11660·cs.CV·February 13, 2026

Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes

Jeongho Noh, Tai Hyoung Rhee, Eunho Lee, Jeongyun Kim, Sunwoo Lee, Ayoung Kim

PDF

Open Access

TL;DR

Clutt3R-Seg is a novel zero-shot 3D instance segmentation method that leverages hierarchical semantic cues and open-vocabulary embeddings to improve robotic grasping in cluttered scenes with sparse views.

Contribution

It introduces a hierarchical instance tree that refines noisy masks using semantic cues and a consistency-aware update for scene changes, enabling robust, view-consistent 3D segmentation in cluttered environments.

Findings

01

Outperforms state-of-the-art baselines in cluttered scenes.

02

Achieves AP@25 of 61.66 in heavy clutter, over 2.2x higher than baselines.

03

Surpasses MaskClustering with fewer views by over 2x.

Abstract

Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view grouping and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications