Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons
Krithik Vishwanath, Anton Alyakin, Mrigayu Ghosh, Jin Vivian Lee, Daniel Alexander Alber, Karl L. Sangwon, Douglas Kondziolka, Eric Karl Oermann

TL;DR
This study evaluates the performance of 28 large language models on neurosurgical exam questions, revealing that while some models pass, their accuracy significantly drops with distracting information, highlighting the need for improved robustness.
Contribution
The paper introduces a distraction framework to assess LLM fragility and provides comprehensive performance data on neurosurgical questions, emphasizing vulnerabilities in current models.
Findings
6 models passed neurosurgery exams
Distractions reduced accuracy by up to 20.4%
Proprietary models were more robust than open-source ones
Abstract
The Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons (CNS-SANS) questions are widely used by neurosurgical residents to prepare for written board examinations. Recently, these questions have also served as benchmarks for evaluating large language models' (LLMs) neurosurgical knowledge. This study aims to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements. A comprehensive evaluation was conducted using 28 large language models. These models were tested on 2,904 neurosurgery board examination questions derived from the CNS-SANS. Additionally, the study introduced a distraction framework to assess the fragility of these models. The framework incorporated simple, irrelevant distractor statements containing polysemous words with clinical meanings used in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Surgical Simulation and Training · Diversity and Career in Medicine
