TL;DR
This paper introduces SurgLIME, a novel framework for surgical vision-language pre-training that effectively utilizes noisy LLM-generated narratives to enhance multi-modal understanding without degrading visual priors.
Contribution
It presents LIME, a scalable surgical video dataset with LLM-generated annotations, and SurgLIME, a parameter-efficient VLP method that mitigates noise through confidence estimation and preserves medical priors.
Findings
SurgLIME achieves competitive zero-shot cross-modal alignment.
The framework maintains robust linear probing performance.
Public dataset, code, and models are available at the provided GitHub link.
Abstract
Recent advancements in self-supervised learning have led to powerful surgical vision encoders capable of spatiotemporal understanding. However, extending these visual foundations to multi-modal reasoning tasks is severely bottlenecked by the prohibitive cost of expert textual annotations. To overcome this scalability limitation, we introduce \textbf{LIME}, a large-scale multi-modal dataset derived from open-access surgical videos using human-free, Large Language Model (LLM)-generated narratives. While LIME offers immense scalability, unverified generated texts may contain errors, including hallucinations, that could potentially lead to catastrophically degraded pre-trained medical priors in standard contrastive pipelines. To mitigate this, we propose \textbf{SurgLIME}, a parameter-efficient Vision-Language Pre-training (VLP) framework designed to learn reliable cross-modal alignments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
