Learning Surgical Robotic Manipulation with 3D Spatial Priors
Yu Sheng, Lidian Wang, Xiaomeng Chu, Jiajun Deng, Min Cheng, Yanyong Zhang, Bei Hua, Houqiang Li, Jianmin Ji

TL;DR
This paper introduces SST, an end-to-end surgical robot manipulation method using 3D spatial cues from endoscopic images, supported by a large-scale dataset, achieving state-of-the-art results in complex surgical tasks.
Contribution
The work presents a novel end-to-end visuomotor policy leveraging 3D spatial cues directly from endoscopic images, along with a large-scale dataset and a geometric transformer for surgical robotics.
Findings
SST outperforms existing methods on complex surgical tasks.
The Surgical3D dataset enables robust 3D feature learning.
Strong spatial generalization demonstrated in experiments.
Abstract
Achieving 3D spatial awareness is crucial for surgical robotic manipulation, where precise and delicate operations are required. Existing methods either explicitly reconstruct the surgical scene prior to manipulation, or enhance multi-view features by adding wrist-mounted cameras to supplement the default stereo endoscopes. However, both paradigms suffer from notable limitations: the former easily leads to error accumulation and prevents end-to-end optimization due to its multi-stage nature, while the latter is rarely adopted in clinical practice since wrist-mounted cameras can interfere with the motion of surgical robot arms. In this work, we introduce the Spatial Surgical Transformer (SST), an end-to-end visuomotor policy that empowers surgical robots with 3D spatial awareness by directly exploring 3D spatial cues embedded in endoscopic images. First, we build Surgical3D, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoft Robotics and Applications · Surgical Simulation and Training · Advanced Vision and Imaging
