SAP-Bench: Benchmarking Multimodal Large Language Models in Surgical Action Planning

Mengya Xu; Zhongzhen Huang; Dillan Imans; Yiru Ye; Xiaofan Zhang; Qi Dou

arXiv:2506.07196·cs.CV·June 16, 2025

SAP-Bench: Benchmarking Multimodal Large Language Models in Surgical Action Planning

Mengya Xu, Zhongzhen Huang, Dillan Imans, Yiru Ye, Xiaofan Zhang, Qi Dou

PDF

Open Access

TL;DR

This paper introduces SAP-Bench, a comprehensive dataset and benchmark for evaluating multimodal large language models in surgical action planning, emphasizing interpretability and safety in life-critical surgical decision-making.

Contribution

It presents SAP-Bench, a large-scale dataset with annotated surgical actions, and evaluates current models, highlighting gaps and proposing a framework for future research in surgical AI.

Findings

01

Current models show significant performance gaps in next action prediction.

02

The dataset enables detailed analysis of model capabilities in surgical contexts.

03

Evaluation reveals critical challenges in applying MLLMs to surgical planning.

Abstract

Effective evaluation is critical for driving advancements in MLLM research. The surgical action planning (SAP) task, which aims to generate future action sequences from visual inputs, demands precise and sophisticated analytical capabilities. Unlike mathematical reasoning, surgical decision-making operates in life-critical domains and requires meticulous, verifiable processes to ensure reliability and patient safety. This task demands the ability to distinguish between atomic visual actions and coordinate complex, long-horizon procedures, capabilities that are inadequately evaluated by current benchmarks. To address this gap, we introduce SAP-Bench, a large-scale, high-quality dataset designed to enable multimodal large language models (MLLMs) to perform interpretable surgical action planning. Our SAP-Bench benchmark, derived from the cholecystectomy procedures context with the mean…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSurgical Simulation and Training · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education