General surgery vision transformer: A video pre-trained foundation model for general surgery
Samuel Schmidgall, Ji Woong Kim, Jeffrey Jopling, Axel Krieger

TL;DR
This paper introduces a large open-source dataset of 680 hours of general surgery videos, a novel video pre-training method for a vision transformer, and demonstrates improved surgical phase annotation performance.
Contribution
It provides the first large-scale surgery video dataset, a real-time video pre-training technique for surgical vision transformers, and procedure-specific fine-tuned models.
Findings
GSViT outperforms state-of-the-art single frame predictors on Cholec80.
Open-source dataset enables broader research in surgical AI.
Pre-trained models facilitate real-time surgical video analysis.
Abstract
The absence of openly accessible data and specialized foundation models is a major barrier for computational research in surgery. Toward this, (i) we open-source the largest dataset of general surgery videos to-date, consisting of 680 hours of surgical videos, including data from robotic and laparoscopic techniques across 28 procedures; (ii) we propose a technique for video pre-training a general surgery vision transformer (GSViT) on surgical videos based on forward video prediction that can run in real-time for surgical applications, toward which we open-source the code and weights of GSViT; (iii) we also release code and weights for procedure-specific fine-tuned versions of GSViT across 10 procedures; (iv) we demonstrate the performance of GSViT on the Cholec80 phase annotation task, displaying improved performance over state-of-the-art single frame predictors.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSurgical Simulation and Training
MethodsAttention Is All You Need · Softmax · Dense Connections · Residual Connection · Linear Layer · Multi-Head Attention · Layer Normalization · Vision Transformer
