Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision
Soham Walimbe, Britty Baby, Vinkle Srivastav, Nicolas Padoy

TL;DR
This paper introduces MML-SurgAdapt, a multi-task surgical computer vision framework using vision-language models and single positive multi-label learning to handle diverse tasks with incomplete annotations, reducing labeling effort and improving scalability.
Contribution
It presents the first application of SPML to multi-task surgical data, integrating multiple tasks with noisy labels using a unified VLM-based model.
Findings
Achieves comparable performance to task-specific models
Reduces annotation effort by 23%
Outperforms existing SPML frameworks in surgical tasks
Abstract
Surgical AI often involves multiple tasks within a single procedure, like phase recognition or assessing the Critical View of Safety in laparoscopic cholecystectomy. Traditional models, built for one task at a time, lack flexibility, requiring a separate model for each. To address this, we introduce MML-SurgAdapt, a unified multi-task framework with Vision-Language Models (VLMs), specifically CLIP, to handle diverse surgical tasks through natural language supervision. A key challenge in multi-task learning is the presence of partial annotations when integrating different tasks. To overcome this, we employ Single Positive Multi-Label (SPML) learning, which traditionally reduces annotation burden by training models with only one positive label per instance. Our framework extends this approach to integrate data from multiple surgical tasks within a single procedure, enabling effective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Imaging and Analysis · Surgical Simulation and Training · Advanced X-ray and CT Imaging
