Scaling Video Pretraining for Surgical Foundation Models

Sicheng Lu; Zikai Xiao; Jianhui Wei; Danyu Sun; Qi Lu; Keli Hu; Yang Feng; Jian Wu; Zongxin Yang; Zuozhu Liu

arXiv:2603.29966·cs.CV·April 3, 2026

Scaling Video Pretraining for Surgical Foundation Models

Sicheng Lu, Zikai Xiao, Jianhui Wei, Danyu Sun, Qi Lu, Keli Hu, Yang Feng, Jian Wu, Zongxin Yang, Zuozhu Liu

PDF

TL;DR

This paper introduces SurgRec, a scalable and reproducible pretraining framework for surgical video understanding, utilizing a large diverse dataset and standardized benchmarks to improve model performance.

Contribution

The authors present SurgRec, a novel pretraining recipe with two variants, and establish a large, multi-source surgical video dataset along with a standardized evaluation pipeline.

Findings

01

SurgRec outperforms SSL baselines and vision-language models on multiple datasets.

02

VLMs are less reliable for fine-grained temporal recognition in surgical videos.

03

The standardized benchmark enables consistent evaluation across diverse surgical domains.

Abstract

Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.