ISCUTE: Instance Segmentation of Cables Using Text Embedding
Shir Kozlovsky, Omkar Joglekar, Dotan Di Castro

TL;DR
This paper introduces ISCUTE, a novel method for segmenting deformable linear objects like cables using text prompts, combining CLIPSeg and SAM models, and achieves state-of-the-art results on a new dataset.
Contribution
The paper presents a new text-promptable segmentation method for DLOs, integrating CLIPSeg and SAM, and provides a dedicated dataset for this task.
Findings
Achieves a mIoU of 91.21% on DLO segmentation
Outperforms previous state-of-the-art methods
Provides a new dataset for DLO instance segmentation
Abstract
In the field of robotics and automation, conventional object recognition and instance segmentation methods face a formidable challenge when it comes to perceiving Deformable Linear Objects (DLOs) like wires, cables, and flexible tubes. This challenge arises primarily from the lack of distinct attributes such as shape, color, and texture, which calls for tailored solutions to achieve precise identification. In this work, we propose a foundation model-based DLO instance segmentation technique that is text-promptable and user-friendly. Specifically, our approach combines the text-conditioned semantic segmentation capabilities of CLIPSeg model with the zero-shot generalization capabilities of Segment Anything Model (SAM). We show that our method exceeds SOTA performance on DLO instance segmentation, achieving a mIoU of . We also introduce a rich and diverse DLO-specific dataset for…
Peer Reviews
Decision·Submitted to ICLR 2024
The proposed approach is relatively straightforward as it is a direct application of CLIPSeg and SAM models to a high specific domain. The main idea of adding adapters is technically sound and also aligns well with the problem setting as in nature well labeled cable images are not easy to acquire. This validates the choice of using adapters in this approach instead of a full fine-tuning. In this sense, the proposed approach is reasonably motivated. In addition, adding text prompt to the model a
This work is overall good without significant flaws, but I do want to mention that it is more an application of existing models to a new domain with some modifications than a novel approach. The way of using these models is relatively straightforward. However, there still are a few questions to be answered. Please see detailed comments below.
(1)The proposed method combines SAM with text conditions, and constructs a prompt encoder to help improve the overall DLO segmentation abilities. (2)The proposed method achieves state-of-the-art performance compared with other recent algorithms on DLO instance segmentation.
(1)It seems that the proposed method relies on the assumption that if properly prompted, SAM can always provide correct cable segmentation masks. As the authors claimed, the performance upper-bound is limited by SAM and CLIPSeg. I wonder what is the exact upper-bound of these two methods, and how close can the proposed method reach this bound? (2)Run-time for each method is not evaluated and analyzed in Table 1 and 2. (3)Typo in Section 2.2: ” ... that have historically have been difficult to se
- Proposed a prompt encoder network to obtain point prompts from text prompts by CLIPSeg. - Proposed a binary classifier network for the quality of SAM-generated masks. - Achieved a solid result.
- Utilizing the combination of two powerful models, CLIPSeg and SAM, may be effective but not novel. - The design motivation in the 3.1.2 section (i.e., MLP, cross-attention, self-attention) is missing. - Few baselines. The only other method mentioned is RT-DLO. Considering the author is leveraging strong semantic segmentation methods, including SAM, they should compare their method with those segmentation methods. - No ablation studies were conducted.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInfrastructure Maintenance and Monitoring · Natural Language Processing Techniques · Handwritten Text Recognition Techniques
