Real-time Capable Learning-based Visual Tool Pose Correction via Differentiable Simulation

Shuyuan Yang; Zonghe Chua

arXiv:2505.08875·cs.RO·March 18, 2026

Real-time Capable Learning-based Visual Tool Pose Correction via Differentiable Simulation

Shuyuan Yang, Zonghe Chua

PDF

TL;DR

This paper introduces a real-time vision transformer-based method for pose estimation and correction in robotic surgery, significantly improving accuracy and speed over traditional optimization approaches.

Contribution

It presents a novel differentiable simulation-based training approach enabling real-time, generalizable pose correction in robotic surgery using vision transformers.

Findings

01

Reduces over 50% of translation errors in datasets

02

Achieves near real-time inference at 22 Hz

03

Generalizes well to unseen datasets

Abstract

Autonomy in robot-assisted minimally invasive surgery has the potential to reduce surgeon cognitive and task load, thereby increasing procedural efficiency. However, implementing accurate autonomous control can be difficult due to poor end-effector proprioception. Joint encoder readings are typically inaccurate due to kinematic non-idealities in their cable-driven transmissions. Vision-based pose estimation approaches are highly effective, but lack real-time capability, generalizability, or can be hard to train. In this work, we demonstrate a real-time capable, Vision Transformer-based pose estimation approach that is trained using end-to-end differentiable kinematics and rendering. We demonstrate the potential of this approach to correct for noisy pose estimates through a real robot dataset and the potential real-time processing ability. Our approach is able to reduce more than 50%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.