ISCUTE: Instance Segmentation of Cables Using Text Embedding

Shir Kozlovsky; Omkar Joglekar; Dotan Di Castro

arXiv:2402.11996·cs.CV·February 28, 2024·1 cites

ISCUTE: Instance Segmentation of Cables Using Text Embedding

Shir Kozlovsky, Omkar Joglekar, Dotan Di Castro

PDF

Open Access 3 Reviews

TL;DR

This paper introduces ISCUTE, a novel method for segmenting deformable linear objects like cables using text prompts, combining CLIPSeg and SAM models, and achieves state-of-the-art results on a new dataset.

Contribution

The paper presents a new text-promptable segmentation method for DLOs, integrating CLIPSeg and SAM, and provides a dedicated dataset for this task.

Findings

01

Achieves a mIoU of 91.21% on DLO segmentation

02

Outperforms previous state-of-the-art methods

03

Provides a new dataset for DLO instance segmentation

Abstract

In the field of robotics and automation, conventional object recognition and instance segmentation methods face a formidable challenge when it comes to perceiving Deformable Linear Objects (DLOs) like wires, cables, and flexible tubes. This challenge arises primarily from the lack of distinct attributes such as shape, color, and texture, which calls for tailored solutions to achieve precise identification. In this work, we propose a foundation model-based DLO instance segmentation technique that is text-promptable and user-friendly. Specifically, our approach combines the text-conditioned semantic segmentation capabilities of CLIPSeg model with the zero-shot generalization capabilities of Segment Anything Model (SAM). We show that our method exceeds SOTA performance on DLO instance segmentation, achieving a mIoU of $91.21%$ . We also introduce a rich and diverse DLO-specific dataset for…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

The proposed approach is relatively straightforward as it is a direct application of CLIPSeg and SAM models to a high specific domain. The main idea of adding adapters is technically sound and also aligns well with the problem setting as in nature well labeled cable images are not easy to acquire. This validates the choice of using adapters in this approach instead of a full fine-tuning. In this sense, the proposed approach is reasonably motivated. In addition, adding text prompt to the model a

Weaknesses

This work is overall good without significant flaws, but I do want to mention that it is more an application of existing models to a new domain with some modifications than a novel approach. The way of using these models is relatively straightforward. However, there still are a few questions to be answered. Please see detailed comments below.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

（1）The proposed method combines SAM with text conditions, and constructs a prompt encoder to help improve the overall DLO segmentation abilities. （2）The proposed method achieves state-of-the-art performance compared with other recent algorithms on DLO instance segmentation.

Weaknesses

（1）It seems that the proposed method relies on the assumption that if properly prompted, SAM can always provide correct cable segmentation masks. As the authors claimed, the performance upper-bound is limited by SAM and CLIPSeg. I wonder what is the exact upper-bound of these two methods, and how close can the proposed method reach this bound? （2）Run-time for each method is not evaluated and analyzed in Table 1 and 2. （3）Typo in Section 2.2: ” ... that have historically have been difficult to se

Reviewer 03Rating 3· reject, not good enoughConfidence 5

Strengths

- Proposed a prompt encoder network to obtain point prompts from text prompts by CLIPSeg. - Proposed a binary classifier network for the quality of SAM-generated masks. - Achieved a solid result.

Weaknesses

- Utilizing the combination of two powerful models, CLIPSeg and SAM, may be effective but not novel. - The design motivation in the 3.1.2 section (i.e., MLP, cross-attention, self-attention) is missing. - Few baselines. The only other method mentioned is RT-DLO. Considering the author is leveraging strong semantic segmentation methods, including SAM, they should compare their method with those segmentation methods. - No ablation studies were conducted.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInfrastructure Maintenance and Monitoring · Natural Language Processing Techniques · Handwritten Text Recognition Techniques