CLIPUNetr: Assisting Human-robot Interface for Uncalibrated Visual   Servoing Control with CLIP-driven Referring Expression Segmentation

Chen Jiang; Yuchen Yang; Martin Jagersand

arXiv:2309.09183·cs.RO·September 19, 2023

CLIPUNetr: Assisting Human-robot Interface for Uncalibrated Visual Servoing Control with CLIP-driven Referring Expression Segmentation

Chen Jiang, Yuchen Yang, Martin Jagersand

PDF

Open Access

TL;DR

This paper introduces CLIPUNetr, a CLIP-driven segmentation network that enhances human-robot interfaces by using natural language expressions for more effective visual servoing in unstructured environments.

Contribution

It presents a novel CLIPUNetr model for referring expression segmentation and integrates it into uncalibrated visual servoing, enabling more natural and semantic-rich robot control.

Findings

01

120% improvement in boundary and structure measurements

02

Successful real-world robot control in unstructured environments

03

Enhanced segmentation quality with sharper boundaries

Abstract

The classical human-robot interface in uncalibrated image-based visual servoing (UIBVS) relies on either human annotations or semantic segmentation with categorical labels. Both methods fail to match natural human communication and convey rich semantics in manipulation tasks as effectively as natural language expressions. In this paper, we tackle this problem by using referring expression segmentation, which is a prompt-based approach, to provide more in-depth information for robot perception. To generate high-quality segmentation predictions from referring expressions, we propose CLIPUNetr - a new CLIP-driven referring expression segmentation network. CLIPUNetr leverages CLIP's strong vision-language representations to segment regions from referring expressions, while utilizing its ``U-shaped'' encoder-decoder architecture to generate predictions with sharper boundaries and finer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection

Methodsfail