General surgery vision transformer: A video pre-trained foundation model   for general surgery

Samuel Schmidgall; Ji Woong Kim; Jeffrey Jopling; Axel Krieger

arXiv:2403.05949·cs.CV·April 16, 2024·2 cites

General surgery vision transformer: A video pre-trained foundation model for general surgery

Samuel Schmidgall, Ji Woong Kim, Jeffrey Jopling, Axel Krieger

PDF

Open Access 1 Repo

TL;DR

This paper introduces a large open-source dataset of 680 hours of general surgery videos, a novel video pre-training method for a vision transformer, and demonstrates improved surgical phase annotation performance.

Contribution

It provides the first large-scale surgery video dataset, a real-time video pre-training technique for surgical vision transformers, and procedure-specific fine-tuned models.

Findings

01

GSViT outperforms state-of-the-art single frame predictors on Cholec80.

02

Open-source dataset enables broader research in surgical AI.

03

Pre-trained models facilitate real-time surgical video analysis.

Abstract

The absence of openly accessible data and specialized foundation models is a major barrier for computational research in surgery. Toward this, (i) we open-source the largest dataset of general surgery videos to-date, consisting of 680 hours of surgical videos, including data from robotic and laparoscopic techniques across 28 procedures; (ii) we propose a technique for video pre-training a general surgery vision transformer (GSViT) on surgical videos based on forward video prediction that can run in real-time for surgical applications, toward which we open-source the code and weights of GSViT; (iii) we also release code and weights for procedure-specific fine-tuned versions of GSViT across 10 procedures; (iv) we demonstrate the performance of GSViT on the Cholec80 phase annotation task, displaying improved performance over state-of-the-art single frame predictors.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

samuelschmidgall/gsvit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSurgical Simulation and Training

MethodsAttention Is All You Need · Softmax · Dense Connections · Residual Connection · Linear Layer · Multi-Head Attention · Layer Normalization · Vision Transformer