Attention-Guided Integration of CLIP and SAM for Precise Object Masking in Robotic Manipulation
Muhammad A. Muttaqien, Tomohiro Motoda, Ryo Hanai, Domae Yukiyasu

TL;DR
This paper presents a new pipeline combining CLIP and SAM models with attention mechanisms to improve object masking accuracy for robotic manipulation in convenience store environments.
Contribution
It introduces a novel integration of CLIP and SAM with gradient-based attention for enhanced object segmentation in robotic tasks.
Findings
Improved mask precision for robotic manipulation.
Effective use of multimodal data for segmentation.
Enhanced adaptability in convenience store scenarios.
Abstract
This paper introduces a novel pipeline to enhance the precision of object masking for robotic manipulation within the specific domain of masking products in convenience stores. The approach integrates two advanced AI models, CLIP and SAM, focusing on their synergistic combination and the effective use of multimodal data (image and text). Emphasis is placed on utilizing gradient-based attention mechanisms and customized datasets to fine-tune performance. While CLIP, SAM, and Grad- CAM are established components, their integration within this structured pipeline represents a significant contribution to the field. The resulting segmented masks, generated through this combined approach, can be effectively utilized as inputs for robotic systems, enabling more precise and adaptive object manipulation in the context of convenience store products.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · Class-activation map · Segment Anything Model
