SAM2Grasp: Resolve Multi-modal Grasping via Prompt-conditioned Temporal Action Prediction

Shengkai Wu; Jinrong Yang; Wenqiu Luo; Linfeng Gao; Chaohui Shang; Meiyu Zhi; Mingshan Sun; Fangping Yang; Liangliang Ren; and Yong Zhao

arXiv:2512.02609·cs.RO·December 3, 2025

SAM2Grasp: Resolve Multi-modal Grasping via Prompt-conditioned Temporal Action Prediction

Shengkai Wu, Jinrong Yang, Wenqiu Luo, Linfeng Gao, Chaohui Shang, Meiyu Zhi, Mingshan Sun, Fangping Yang, Liangliang Ren, and Yong Zhao

PDF

Open Access

TL;DR

SAM2Grasp introduces a prompt-conditioned, temporal action prediction framework that resolves multimodal grasping ambiguity, enabling continuous, unambiguous robotic grasping in cluttered scenes by leveraging a frozen visual model and a lightweight trainable head.

Contribution

The paper presents a novel approach that reformulates multi-object grasping as a prompt-conditioned prediction task, utilizing a frozen visual model with a trainable action head for improved performance.

Findings

01

Achieves state-of-the-art results in multi-object grasping tasks.

02

Effectively eliminates ambiguity in visuomotor policies.

03

Maintains stable object tracking during grasping in cluttered scenes.

Abstract

Imitation learning for robotic grasping is often plagued by the multimodal problem: when a scene contains multiple valid targets, demonstrations of grasping different objects create conflicting training signals. Standard imitation learning policies fail by averaging these distinct actions into a single, invalid action. In this paper, we introduce SAM2Grasp, a novel framework that resolves this issue by reformulating the task as a uni-modal, prompt-conditioned prediction problem. Our method leverages the frozen SAM2 model to use its powerful visual temporal tracking capability and introduces a lightweight, trainable action head that operates in parallel with its native segmentation head. This design allows for training only the small action head on pre-computed temporal-visual features from SAM2. During inference, an initial prompt, such as a bounding box provided by an upstream object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Motor Control and Adaptation · Human Pose and Action Recognition