Scaling up self-supervised learning for improved surgical foundation models

Tim J.M. Jaspers; Ronald L.P.D. de Jong; Yiping Li; Carolus H.J. Kusters; Franciscus H.A. Bakker; Romy C. van Jaarsveld; Gino M. Kuiper; Richard van Hillegersberg; Jelle P. Ruurda; Willem M. Brinkman; Josien P.W. Pluim; Peter H.N. de With; Marcel Breeuwer; Yasmina Al Khalil; Fons van der Sommen

arXiv:2501.09436·cs.CV·November 26, 2025

Scaling up self-supervised learning for improved surgical foundation models

Tim J.M. Jaspers, Ronald L.P.D. de Jong, Yiping Li, Carolus H.J. Kusters, Franciscus H.A. Bakker, Romy C. van Jaarsveld, Gino M. Kuiper, Richard van Hillegersberg, Jelle P. Ruurda, Willem M. Brinkman, Josien P.W. Pluim, Peter H.N. de With, Marcel Breeuwer, Yasmina Al Khalil

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces SurgeNetXL, a large-scale surgical foundation model trained on over 4.7 million video frames, achieving state-of-the-art performance across multiple surgical vision tasks and providing insights into scaling datasets and model architectures.

Contribution

The study presents SurgeNetXL, a novel surgical foundation model trained on the largest surgical dataset to date, setting new benchmarks and offering key insights into scaling pretraining for surgical computer vision.

Findings

01

SurgeNetXL outperforms previous models by 2.4-12.6% across tasks.

02

SurgeNetXL surpasses ImageNet-based variants by 1.6-14.4%.

03

Scaling datasets and training duration improves model performance.

Abstract

Foundation models have revolutionized computer vision by achieving vastly superior performance across diverse tasks through large-scale pretraining on extensive datasets. However, their application in surgical computer vision has been limited. This study addresses this gap by introducing SurgeNetXL, a novel surgical foundation model that sets a new benchmark in surgical computer vision. Trained on the largest reported surgical dataset to date, comprising over 4.7 million video frames, SurgeNetXL achieves consistent top-tier performance across six datasets spanning four surgical procedures and three tasks, including semantic segmentation, phase recognition, and critical view of safety (CVS) classification. Compared with the best-performing surgical foundation models, SurgeNetXL shows mean improvements of 2.4, 9.0, and 12.6 percent for semantic segmentation, phase recognition, and CVS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

timjaspers0801/surgenet
pytorchOfficial

Datasets

TimJaspersTue/SurgeNetYoutube
dataset· 212 dl
212 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.