Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?

Chengan Che; Chao Wang; Jiayuan Huang; Xinyue Chen; Luis C. Garcia-Peraza-Herrera

arXiv:2604.18134·cs.CV·May 5, 2026

Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?

Chengan Che, Chao Wang, Jiayuan Huang, Xinyue Chen, Luis C. Garcia-Peraza-Herrera

PDF

1 Repo

TL;DR

This paper introduces SurgLIME, a novel framework for surgical vision-language pre-training that effectively utilizes noisy LLM-generated narratives to enhance multi-modal understanding without degrading visual priors.

Contribution

It presents LIME, a scalable surgical video dataset with LLM-generated annotations, and SurgLIME, a parameter-efficient VLP method that mitigates noise through confidence estimation and preserves medical priors.

Findings

01

SurgLIME achieves competitive zero-shot cross-modal alignment.

02

The framework maintains robust linear probing performance.

03

Public dataset, code, and models are available at the provided GitHub link.

Abstract

Recent advancements in self-supervised learning have led to powerful surgical vision encoders capable of spatiotemporal understanding. However, extending these visual foundations to multi-modal reasoning tasks is severely bottlenecked by the prohibitive cost of expert textual annotations. To overcome this scalability limitation, we introduce \textbf{LIME}, a large-scale multi-modal dataset derived from open-access surgical videos using human-free, Large Language Model (LLM)-generated narratives. While LIME offers immense scalability, unverified generated texts may contain errors, including hallucinations, that could potentially lead to catastrophically degraded pre-trained medical priors in standard contrastive pipelines. To mitigate this, we propose \textbf{SurgLIME}, a parameter-efficient Vision-Language Pre-training (VLP) framework designed to learn reliable cross-modal alignments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

visurg-ai/SurgLIME
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.