Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons

Krithik Vishwanath; Anton Alyakin; Mrigayu Ghosh; Jin Vivian Lee; Daniel Alexander Alber; Karl L. Sangwon; Douglas Kondziolka; Eric Karl Oermann

arXiv:2505.23477·cs.CL·May 30, 2025

Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons

Krithik Vishwanath, Anton Alyakin, Mrigayu Ghosh, Jin Vivian Lee, Daniel Alexander Alber, Karl L. Sangwon, Douglas Kondziolka, Eric Karl Oermann

PDF

Open Access

TL;DR

This study evaluates the performance of 28 large language models on neurosurgical exam questions, revealing that while some models pass, their accuracy significantly drops with distracting information, highlighting the need for improved robustness.

Contribution

The paper introduces a distraction framework to assess LLM fragility and provides comprehensive performance data on neurosurgical questions, emphasizing vulnerabilities in current models.

Findings

01

6 models passed neurosurgery exams

02

Distractions reduced accuracy by up to 20.4%

03

Proprietary models were more robust than open-source ones

Abstract

The Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons (CNS-SANS) questions are widely used by neurosurgical residents to prepare for written board examinations. Recently, these questions have also served as benchmarks for evaluating large language models' (LLMs) neurosurgical knowledge. This study aims to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements. A comprehensive evaluation was conducted using 28 large language models. These models were tested on 2,904 neurosurgery board examination questions derived from the CNS-SANS. Additionally, the study introduced a distraction framework to assess the fragility of these models. The framework incorporated simple, irrelevant distractor statements containing polysemous words with clinical meanings used in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Surgical Simulation and Training · Diversity and Career in Medicine