VidLPRO: A $\underline{Vid}$eo-$\underline{L}$anguage   $\underline{P}$re-training Framework for $\underline{Ro}$botic and   Laparoscopic Surgery

Mohammadmahdi Honarmand; Muhammad Abdullah Jamal; Omid Mohareri

arXiv:2409.04732·cs.CV·September 13, 2024

VidLPRO: A $\underline{Vid}$eo-$\underline{L}$anguage $\underline{P}$re-training Framework for $\underline{Ro}$botic and Laparoscopic Surgery

Mohammadmahdi Honarmand, Muhammad Abdullah Jamal, Omid Mohareri

PDF

Open Access

TL;DR

VidLPRO is a new video-language pre-training framework tailored for robotic and laparoscopic surgery, integrating multiple learning objectives and a large-scale dataset to improve surgical video understanding and achieve state-of-the-art results.

Contribution

The paper introduces VidLPRO, a comprehensive VL pre-training approach with a novel dataset, advancing surgical video analysis beyond contrastive learning methods.

Findings

01

Achieves up to 21.5% accuracy improvement in zero-shot surgical phase recognition

02

Sets new benchmarks on Cholec80 and AutoLaparo datasets

03

Demonstrates robustness with single-frame inference and scalable temporal context

Abstract

We introduce VidLPRO, a novel video-language (VL) pre-training framework designed specifically for robotic and laparoscopic surgery. While existing surgical VL models primarily rely on contrastive learning, we propose a more comprehensive approach to capture the intricate temporal dynamics and align video with language. VidLPRO integrates video-text contrastive learning, video-text matching, and masked language modeling objectives to learn rich VL representations. To support this framework, we present GenSurg+, a carefully curated dataset derived from GenSurgery, comprising 17k surgical video clips paired with captions generated by GPT-4 using transcripts extracted by the Whisper model. This dataset addresses the need for large-scale, high-quality VL data in the surgical domain. Extensive experiments on benchmark datasets, including Cholec80 and AutoLaparo, demonstrate the efficacy of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education

MethodsByte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Layer Normalization · Dropout · Attention Is All You Need · Position-Wise Feed-Forward Layer · Residual Connection · Linear Layer