Generalized Recognition of Basic Surgical Actions Enables Skill Assessment and Vision-Language-Model-based Surgical Planning

Mengya Xu; Daiyun Shen; Jie Zhang; Hon Chi Yip; Yujia Gao; Cheng Chen; Dillan Imans; Yonghao Long; Yiru Ye; Yixiao Liu; Rongyun Mai; Kai Chen; Hongliang Ren; Yutong Ban; Guangsuo Wang; Francis Wong; Chi-Fai Ng; Kee Yuan Ngiam; Russell H. Taylor; Daguang Xu; Yueming Jin; Qi Dou

arXiv:2603.12787·cs.CV·March 16, 2026

Generalized Recognition of Basic Surgical Actions Enables Skill Assessment and Vision-Language-Model-based Surgical Planning

Mengya Xu, Daiyun Shen, Jie Zhang, Hon Chi Yip, Yujia Gao, Cheng Chen, Dillan Imans, Yonghao Long, Yiru Ye, Yixiao Liu, Rongyun Mai, Kai Chen, Hongliang Ren, Yutong Ban, Guangsuo Wang, Francis Wong, Chi-Fai Ng, Kee Yuan Ngiam, Russell H. Taylor, Daguang Xu, Yueming Jin, Qi Dou

PDF

Open Access

TL;DR

This paper introduces a large dataset and a foundation model for recognizing basic surgical actions across specialties, enabling improved skill assessment and surgical planning through vision-language models.

Contribution

It presents the largest BSA dataset and a new foundation model capable of cross-specialty recognition and downstream surgical applications.

Findings

01

Robust cross-specialist action recognition demonstrated.

02

Effective surgical skill assessment in prostatectomy.

03

Action planning supported by vision-language models.

Abstract

Artificial intelligence, imaging, and large language models have the potential to transform surgical practice, training, and automation. Understanding and modeling of basic surgical actions (BSA), the fundamental unit of operation in any surgery, is important to drive the evolution of this field. In this paper, we present a BSA dataset comprising 10 basic actions across 6 surgical specialties with over 11,000 video clips, which is the largest to date. Based on the BSA dataset, we developed a new foundation model that conducts general-purpose recognition of basic actions. Our approach demonstrates robust cross-specialist performance in experiments validated on datasets from different procedural types and various body parts. Furthermore, we demonstrate downstream applications enabled by the BAS foundation model through surgical skill assessment in prostatectomy using domain-specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSurgical Simulation and Training · Multimodal Machine Learning Applications · Soft Robotics and Applications