Scaling Video Pretraining for Surgical Foundation Models
Sicheng Lu, Zikai Xiao, Jianhui Wei, Danyu Sun, Qi Lu, Keli Hu, Yang Feng, Jian Wu, Zongxin Yang, Zuozhu Liu

TL;DR
This paper introduces SurgRec, a scalable and reproducible pretraining framework for surgical video understanding, utilizing a large diverse dataset and standardized benchmarks to improve model performance.
Contribution
The authors present SurgRec, a novel pretraining recipe with two variants, and establish a large, multi-source surgical video dataset along with a standardized evaluation pipeline.
Findings
SurgRec outperforms SSL baselines and vision-language models on multiple datasets.
VLMs are less reliable for fine-grained temporal recognition in surgical videos.
The standardized benchmark enables consistent evaluation across diverse surgical domains.
Abstract
Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
