Visuomotor Grasping with World Models for Surgical Robots
Hongbin Lin, Bin Li, and Kwok Wai Samuel Au

TL;DR
This paper presents GASv2, a visuomotor learning framework for surgical grasping that generalizes to unseen objects and environments, using a world-model architecture trained in simulation and successfully deployed in real surgical settings.
Contribution
Introduces GASv2, a novel visuomotor policy for surgical grasping that achieves sim-to-real transfer, object-agnostic generalization, and robustness using a single stereo camera setup.
Findings
65% success rate in real surgical environments
Generalizes to unseen objects and tools
Robust to visual disturbances and environment variations
Abstract
Grasping is a fundamental task in robot-assisted surgery (RAS), and automating it can reduce surgeon workload while enhancing efficiency, safety, and consistency beyond teleoperated systems. Most prior approaches rely on explicit object pose tracking or handcrafted visual features, limiting their generalization to novel objects, robustness to visual disturbances, and the ability to handle deformable objects. Visuomotor learning offers a promising alternative, but deploying it in RAS presents unique challenges, such as low signal-to-noise ratio in visual observations, demands for high safety and millimeter-level precision, as well as the complex surgical environment. This paper addresses three key challenges: (i) sim-to-real transfer of visuomotor policies to ex vivo surgical scenes, (ii) visuomotor learning using only a single stereo camera pair -- the standard RAS setup, and (iii)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
