Visuomotor Grasping with World Models for Surgical Robots

Hongbin Lin; Bin Li; and Kwok Wai Samuel Au

arXiv:2508.11200·cs.RO·August 18, 2025

Visuomotor Grasping with World Models for Surgical Robots

Hongbin Lin, Bin Li, and Kwok Wai Samuel Au

PDF

TL;DR

This paper presents GASv2, a visuomotor learning framework for surgical grasping that generalizes to unseen objects and environments, using a world-model architecture trained in simulation and successfully deployed in real surgical settings.

Contribution

Introduces GASv2, a novel visuomotor policy for surgical grasping that achieves sim-to-real transfer, object-agnostic generalization, and robustness using a single stereo camera setup.

Findings

01

65% success rate in real surgical environments

02

Generalizes to unseen objects and tools

03

Robust to visual disturbances and environment variations

Abstract

Grasping is a fundamental task in robot-assisted surgery (RAS), and automating it can reduce surgeon workload while enhancing efficiency, safety, and consistency beyond teleoperated systems. Most prior approaches rely on explicit object pose tracking or handcrafted visual features, limiting their generalization to novel objects, robustness to visual disturbances, and the ability to handle deformable objects. Visuomotor learning offers a promising alternative, but deploying it in RAS presents unique challenges, such as low signal-to-noise ratio in visual observations, demands for high safety and millimeter-level precision, as well as the complex surgical environment. This paper addresses three key challenges: (i) sim-to-real transfer of visuomotor policies to ex vivo surgical scenes, (ii) visuomotor learning using only a single stereo camera pair -- the standard RAS setup, and (iii)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.