SAP-Bench: Benchmarking Multimodal Large Language Models in Surgical Action Planning
Mengya Xu, Zhongzhen Huang, Dillan Imans, Yiru Ye, Xiaofan Zhang, Qi Dou

TL;DR
This paper introduces SAP-Bench, a comprehensive dataset and benchmark for evaluating multimodal large language models in surgical action planning, emphasizing interpretability and safety in life-critical surgical decision-making.
Contribution
It presents SAP-Bench, a large-scale dataset with annotated surgical actions, and evaluates current models, highlighting gaps and proposing a framework for future research in surgical AI.
Findings
Current models show significant performance gaps in next action prediction.
The dataset enables detailed analysis of model capabilities in surgical contexts.
Evaluation reveals critical challenges in applying MLLMs to surgical planning.
Abstract
Effective evaluation is critical for driving advancements in MLLM research. The surgical action planning (SAP) task, which aims to generate future action sequences from visual inputs, demands precise and sophisticated analytical capabilities. Unlike mathematical reasoning, surgical decision-making operates in life-critical domains and requires meticulous, verifiable processes to ensure reliability and patient safety. This task demands the ability to distinguish between atomic visual actions and coordinate complex, long-horizon procedures, capabilities that are inadequately evaluated by current benchmarks. To address this gap, we introduce SAP-Bench, a large-scale, high-quality dataset designed to enable multimodal large language models (MLLMs) to perform interpretable surgical action planning. Our SAP-Bench benchmark, derived from the cholecystectomy procedures context with the mean…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSurgical Simulation and Training · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education
