ZISVFM: Zero-Shot Object Instance Segmentation in Indoor Robotic   Environments with Vision Foundation Models

Ying Zhang; Maoliang Yin; Wenfu Bi; Haibao Yan; Shaohan Bian; Cui-Hua; Zhang; Changchun Hua

arXiv:2502.03266·cs.CV·February 6, 2025

ZISVFM: Zero-Shot Object Instance Segmentation in Indoor Robotic Environments with Vision Foundation Models

Ying Zhang, Maoliang Yin, Wenfu Bi, Haibao Yan, Shaohan Bian, Cui-Hua, Zhang, Changchun Hua

PDF

Open Access

TL;DR

This paper introduces ZISVFM, a zero-shot object instance segmentation method for indoor robots that combines SAM and a self-supervised ViT to accurately segment unknown objects without extensive training.

Contribution

The novel framework leverages zero-shot capabilities of SAM and visual transformer features to improve unseen object segmentation in robotic environments.

Findings

01

Outperforms existing UOIS methods on benchmark datasets

02

Effective in complex hierarchical environments like cabinets and drawers

03

Operates without extensive annotated training data

Abstract

Service robots operating in unstructured environments must effectively recognize and segment unknown objects to enhance their functionality. Traditional supervised learningbased segmentation techniques require extensive annotated datasets, which are impractical for the diversity of objects encountered in real-world scenarios. Unseen Object Instance Segmentation (UOIS) methods aim to address this by training models on synthetic data to generalize to novel objects, but they often suffer from the simulation-to-reality gap. This paper proposes a novel approach (ZISVFM) for solving UOIS by leveraging the powerful zero-shot capability of the segment anything model (SAM) and explicit visual representations from a selfsupervised vision transformer (ViT). The proposed framework operates in three stages: (1) generating object-agnostic mask proposals from colorized depth images using SAM, (2)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Dense Connections · Residual Connection · Multi-Head Attention · Vision Transformer · Segment Anything Model