SurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos

Jinlin Wu; Felix Holm; Chuxi Chen; An Wang; Yaxin Hu; Xiaofan Ye; Zelin Zang; Miao Xu; Lihua Zhou; Huai Liao; Danny T. M. Chan; Ming Feng; Wai S. Poon; Hongliang Ren; Dong Yi; Nassir Navab; Gaofeng Meng; Jiebo Luo; Hongbin Liu; Zhen Lei

arXiv:2602.05638·cs.CV·April 20, 2026

SurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos

Jinlin Wu, Felix Holm, Chuxi Chen, An Wang, Yaxin Hu, Xiaofan Ye, Zelin Zang, Miao Xu, Lihua Zhou, Huai Liao, Danny T. M. Chan, Ming Feng, Wai S. Poon, Hongliang Ren, Dong Yi, Nassir Navab, Gaofeng Meng, Jiebo Luo, Hongbin Liu, Zhen Lei

PDF

3 Models

TL;DR

SurgMotion introduces a novel video-native foundation model for surgical videos that emphasizes semantic motion understanding over pixel-level details, achieving state-of-the-art results across multiple benchmarks.

Contribution

The paper presents SurgMotion, a new model with three innovations and a large-scale surgical video dataset, significantly improving surgical video analysis performance.

Findings

01

Outperforms state-of-the-art on surgical workflow recognition with up to 14.6% F1 score improvement.

02

Achieves 39.54% mAP-IVT on action triplet recognition.

03

Demonstrates effectiveness on skill assessment, polyp segmentation, and depth estimation.

Abstract

While foundation models have advanced surgical video analysis, current approaches rely predominantly on pixel-level reconstruction objectives that waste model capacity on low-level visual details, such as smoke, specular reflections, and fluid motion, rather than semantic structures essential for surgical understanding. We present SurgMotion, a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction. Built on the Video Joint Embedding Predictive Architecture (V-JEPA), SurgMotion introduces three key technical innovations tailored to surgical videos: (1) motion-guided latent masked prediction to prioritize semantically meaningful regions, (2) spatiotemporal affinity self-distillation to enforce relational consistency, and (3) spatiotemporal feature diversity regularization (SFDR) to prevent representation collapse in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.