Multimodal Optimal Transport for Training-free Temporal Segmentation in Surgical Robotics
Omar Mohamed, Edoardo Fazzari, Ayah Al-Naji, Hamdan Alhadhrami, Khalfan Hableel, Saif Alkindi, Ivan Laptev, Cesare Stefanini

TL;DR
TASOT is a novel annotation-free framework that combines visual and textual cues via optimal transport for surgical temporal segmentation, eliminating the need for extensive labeled data or domain-specific pretraining.
Contribution
It introduces TASOT, which fuses visual and semantic information through optimal transport for accurate surgical workflow segmentation without annotations or pretraining.
Findings
Achieves +18.9 F1 on Cholec80 dataset.
Outperforms zero-shot baselines on multiple surgical datasets.
Demonstrates effective surgical phase recognition without manual annotations.
Abstract
Automated recognition of surgical phases and steps is a fundamental capability for intraoperative decision support, workflow automation, and skill assessment in robotic-assisted surgery. Existing approaches either depend on large-scale annotated surgical datasets or require expensive domain-specific pretraining on thousands of labeled videos, limiting their practical deployability across diverse robotic platforms and clinical environments. In this work, we propose TASOT (Text-Augmented Action Segmentation Optimal Transport), an annotation-free framework for surgical temporal segmentation that requires no task-specific annotations or surgical-domain pretraining. TASOT extends the Action Segmentation Optimal Transport (ASOT) formulation by incorporating temporally aligned textual descriptions generated directly from the input video, fusing visual and semantic cues within a unified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSurgical Simulation and Training · Multimodal Machine Learning Applications · Human Pose and Action Recognition
