The MedPerturb Dataset: What Non-Content Perturbations Reveal About Human and Clinical LLM Decision Making
Abinitha Gourabathina, Yuexing Hao, Walter Gerych, Marzyeh Ghassemi

TL;DR
This paper introduces MedPerturb, a dataset for evaluating medical LLM robustness to clinical input variability, revealing differences in how humans and models respond to gender, style, and format perturbations.
Contribution
The paper presents MedPerturb, a novel dataset with controlled perturbations to assess medical LLMs' robustness and compare their decision-making to humans in clinical scenarios.
Findings
LLMs are more sensitive to gender and style perturbations.
Humans are more sensitive to format changes like summaries.
Evaluation beyond static benchmarks is necessary for clinical safety.
Abstract
Clinical robustness is critical to the safe deployment of medical Large Language Models (LLMs), but key questions remain about how LLMs and humans may differ in response to the real-world variability typified by clinical settings. To address this, we introduce MedPerturb, a dataset designed to systematically evaluate medical LLMs under controlled perturbations of clinical input. MedPerturb consists of clinical vignettes spanning a range of pathologies, each transformed along three axes: (1) gender modifications (e.g., gender-swapping or gender-removal); (2) style variation (e.g., uncertain phrasing or colloquial tone); and (3) format changes (e.g., LLM-generated multi-turn conversations or summaries). With MedPerturb, we release a dataset of 800 clinical contexts grounded in realistic input variability, outputs from four LLMs, and three human expert reads per clinical context. We use…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Mental Health via Writing
