Multimodal Optimal Transport for Training-free Temporal Segmentation in Surgical Robotics

Omar Mohamed; Edoardo Fazzari; Ayah Al-Naji; Hamdan Alhadhrami; Khalfan Hableel; Saif Alkindi; Ivan Laptev; Cesare Stefanini

arXiv:2602.24138·cs.CV·May 21, 2026

Multimodal Optimal Transport for Training-free Temporal Segmentation in Surgical Robotics

Omar Mohamed, Edoardo Fazzari, Ayah Al-Naji, Hamdan Alhadhrami, Khalfan Hableel, Saif Alkindi, Ivan Laptev, Cesare Stefanini

PDF

TL;DR

TASOT is a novel annotation-free framework that combines visual and textual cues via optimal transport for surgical temporal segmentation, eliminating the need for extensive labeled data or domain-specific pretraining.

Contribution

It introduces TASOT, which fuses visual and semantic information through optimal transport for accurate surgical workflow segmentation without annotations or pretraining.

Findings

01

Achieves +18.9 F1 on Cholec80 dataset.

02

Outperforms zero-shot baselines on multiple surgical datasets.

03

Demonstrates effective surgical phase recognition without manual annotations.

Abstract

Automated recognition of surgical phases and steps is a fundamental capability for intraoperative decision support, workflow automation, and skill assessment in robotic-assisted surgery. Existing approaches either depend on large-scale annotated surgical datasets or require expensive domain-specific pretraining on thousands of labeled videos, limiting their practical deployability across diverse robotic platforms and clinical environments. In this work, we propose TASOT (Text-Augmented Action Segmentation Optimal Transport), an annotation-free framework for surgical temporal segmentation that requires no task-specific annotations or surgical-domain pretraining. TASOT extends the Action Segmentation Optimal Transport (ASOT) formulation by incorporating temporally aligned textual descriptions generated directly from the input video, fusing visual and semantic cues within a unified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSurgical Simulation and Training · Multimodal Machine Learning Applications · Human Pose and Action Recognition