Learning Surgical Robotic Manipulation with 3D Spatial Priors

Yu Sheng; Lidian Wang; Xiaomeng Chu; Jiajun Deng; Min Cheng; Yanyong Zhang; Bei Hua; Houqiang Li; Jianmin Ji

arXiv:2603.03798·cs.RO·March 5, 2026

Learning Surgical Robotic Manipulation with 3D Spatial Priors

Yu Sheng, Lidian Wang, Xiaomeng Chu, Jiajun Deng, Min Cheng, Yanyong Zhang, Bei Hua, Houqiang Li, Jianmin Ji

PDF

Open Access

TL;DR

This paper introduces SST, an end-to-end surgical robot manipulation method using 3D spatial cues from endoscopic images, supported by a large-scale dataset, achieving state-of-the-art results in complex surgical tasks.

Contribution

The work presents a novel end-to-end visuomotor policy leveraging 3D spatial cues directly from endoscopic images, along with a large-scale dataset and a geometric transformer for surgical robotics.

Findings

01

SST outperforms existing methods on complex surgical tasks.

02

The Surgical3D dataset enables robust 3D feature learning.

03

Strong spatial generalization demonstrated in experiments.

Abstract

Achieving 3D spatial awareness is crucial for surgical robotic manipulation, where precise and delicate operations are required. Existing methods either explicitly reconstruct the surgical scene prior to manipulation, or enhance multi-view features by adding wrist-mounted cameras to supplement the default stereo endoscopes. However, both paradigms suffer from notable limitations: the former easily leads to error accumulation and prevents end-to-end optimization due to its multi-stage nature, while the latter is rarely adopted in clinical practice since wrist-mounted cameras can interfere with the motion of surgical robot arms. In this work, we introduce the Spatial Surgical Transformer (SST), an end-to-end visuomotor policy that empowers surgical robots with 3D spatial awareness by directly exploring 3D spatial cues embedded in endoscopic images. First, we build Surgical3D, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoft Robotics and Applications · Surgical Simulation and Training · Advanced Vision and Imaging