Large-scale Self-supervised Video Foundation Model for Intelligent Surgery

Shu Yang; Fengtao Zhou; Leon Mayer; Fuxiang Huang; Yiliang Chen; Yihui Wang; Sunan He; Yuxiang Nie; Xi Wang; \"Omer S\"umer; Yueming Jin; Huihui Sun; Shuchang Xu; Alex Qinyang Liu; Zheng Li; Jing Qin; Jeremy YuenChun Teoh; Lena Maier-Hein; Hao Chen

arXiv:2506.02692·cs.CV·June 4, 2025

Large-scale Self-supervised Video Foundation Model for Intelligent Surgery

Shu Yang, Fengtao Zhou, Leon Mayer, Fuxiang Huang, Yiliang Chen, Yihui Wang, Sunan He, Yuxiang Nie, Xi Wang, \"Omer S\"umer, Yueming Jin, Huihui Sun, Shuchang Xu, Alex Qinyang Liu, Zheng Li, Jing Qin, Jeremy YuenChun Teoh, Lena Maier-Hein, Hao Chen

PDF

Open Access

TL;DR

This paper introduces SurgVISTA, a large-scale self-supervised video pre-training framework for surgical scene understanding, capturing spatiotemporal features from extensive surgical videos to improve AI-assisted surgery.

Contribution

The work presents the first joint spatiotemporal surgical pre-training method using a large-scale dataset and a reconstruction-based approach with knowledge distillation, advancing surgical AI capabilities.

Findings

01

SurgVISTA outperforms existing models on multiple surgical video benchmarks.

02

The large-scale dataset enables robust spatiotemporal feature learning.

03

Knowledge distillation enhances fine-grained anatomical understanding.

Abstract

Computer-Assisted Intervention (CAI) has the potential to revolutionize modern surgery, with surgical scene understanding serving as a critical component in supporting decision-making, improving procedural efficacy, and ensuring intraoperative safety. While existing AI-driven approaches alleviate annotation burdens via self-supervised spatial representation learning, their lack of explicit temporal modeling during pre-training fundamentally restricts the capture of dynamic surgical contexts, resulting in incomplete spatiotemporal understanding. In this work, we introduce the first video-level surgical pre-training framework that enables joint spatiotemporal representation learning from large-scale surgical video data. To achieve this, we constructed a large-scale surgical video dataset comprising 3,650 videos and approximately 3.55 million frames, spanning more than 20 surgical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Imaging and Analysis