NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification

Ziyang Song; Zelin Zang; Xiaofan Ye; Boqiang Xu; Long Bai; Jinlin Wu; Hongliang Ren; Hongbin Liu; Jiebo Luo; Zhen Lei

arXiv:2512.06921·cs.CV·December 9, 2025

NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification

Ziyang Song, Zelin Zang, Xiaofan Ye, Boqiang Xu, Long Bai, Jinlin Wu, Hongliang Ren, Hongbin Liu, Jiebo Luo, Zhen Lei

PDF

Open Access

TL;DR

NeuroABench is a new multimodal benchmark designed to evaluate neurosurgical anatomical understanding, revealing significant gaps in current models' performance compared to human experts, and highlighting the need for further development.

Contribution

This paper introduces NeuroABench, the first comprehensive multimodal benchmark for assessing anatomical comprehension in neurosurgery, with extensive annotated videos and a standardized evaluation framework.

Findings

01

Best MLLM achieves 40.87% accuracy in anatomical identification.

02

Neurosurgical trainees reach up to 56% accuracy, outperforming models.

03

Significant performance gap remains between models and human experts.

Abstract

Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding. With improved zero-shot performance and more effective human-machine interaction, they provide a strong foundation for advancing surgical education and assistance. However, existing research and datasets primarily focus on understanding surgical procedures and workflows, while paying limited attention to the critical role of anatomical comprehension. In clinical practice, surgeons rely heavily on precise anatomical understanding to interpret, review, and learn from surgical videos. To fill this gap, we introduce the Neurosurgical Anatomy Benchmark (NeuroABench), the first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain. NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures and is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSurgical Simulation and Training · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education