Real-time Capable Learning-based Visual Tool Pose Correction via Differentiable Simulation
Shuyuan Yang, Zonghe Chua

TL;DR
This paper introduces a real-time vision transformer-based method for pose estimation and correction in robotic surgery, significantly improving accuracy and speed over traditional optimization approaches.
Contribution
It presents a novel differentiable simulation-based training approach enabling real-time, generalizable pose correction in robotic surgery using vision transformers.
Findings
Reduces over 50% of translation errors in datasets
Achieves near real-time inference at 22 Hz
Generalizes well to unseen datasets
Abstract
Autonomy in robot-assisted minimally invasive surgery has the potential to reduce surgeon cognitive and task load, thereby increasing procedural efficiency. However, implementing accurate autonomous control can be difficult due to poor end-effector proprioception. Joint encoder readings are typically inaccurate due to kinematic non-idealities in their cable-driven transmissions. Vision-based pose estimation approaches are highly effective, but lack real-time capability, generalizability, or can be hard to train. In this work, we demonstrate a real-time capable, Vision Transformer-based pose estimation approach that is trained using end-to-end differentiable kinematics and rendering. We demonstrate the potential of this approach to correct for noisy pose estimates through a real robot dataset and the potential real-time processing ability. Our approach is able to reduce more than 50%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
