Attention-Guided Integration of CLIP and SAM for Precise Object Masking   in Robotic Manipulation

Muhammad A. Muttaqien; Tomohiro Motoda; Ryo Hanai; Domae Yukiyasu

arXiv:2502.18842·cs.RO·March 3, 2025

Attention-Guided Integration of CLIP and SAM for Precise Object Masking in Robotic Manipulation

Muhammad A. Muttaqien, Tomohiro Motoda, Ryo Hanai, Domae Yukiyasu

PDF

Open Access

TL;DR

This paper presents a new pipeline combining CLIP and SAM models with attention mechanisms to improve object masking accuracy for robotic manipulation in convenience store environments.

Contribution

It introduces a novel integration of CLIP and SAM with gradient-based attention for enhanced object segmentation in robotic tasks.

Findings

01

Improved mask precision for robotic manipulation.

02

Effective use of multimodal data for segmentation.

03

Enhanced adaptability in convenience store scenarios.

Abstract

This paper introduces a novel pipeline to enhance the precision of object masking for robotic manipulation within the specific domain of masking products in convenience stores. The approach integrates two advanced AI models, CLIP and SAM, focusing on their synergistic combination and the effective use of multimodal data (image and text). Emphasis is placed on utilizing gradient-based attention mechanisms and customized datasets to fine-tune performance. While CLIP, SAM, and Grad- CAM are established components, their integration within this structured pipeline represents a significant contribution to the field. The resulting segmented masks, generated through this combined approach, can be effectively utilized as inputs for robotic systems, enabling more precise and adaptive object manipulation in the context of convenience store products.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · Class-activation map · Segment Anything Model