Open (Clinical) LLMs are Sensitive to Instruction Phrasings
Alberto Mario Ceballos Arroyo, Monica Munnangi, Jiuding Sun, Karen, Y.C. Zhang, Denis Jered McInerney, Byron C. Wallace, Silvio Amir

TL;DR
This study investigates how instruction-tuned Large Language Models (LLMs) used in healthcare are sensitive to variations in instruction phrasing, revealing significant performance and fairness fluctuations, especially in domain-specific models.
Contribution
It provides a systematic evaluation of the robustness of various clinical LLMs to natural instruction phrasing variations, highlighting unexpected brittleness in domain-specific models.
Findings
Performance varies significantly with instruction phrasing.
Domain-specific models are more brittle than general models.
Instruction phrasing impacts fairness across demographic groups.
Abstract
Instruction-tuned Large Language Models (LLMs) can perform a wide range of tasks given natural language instructions to do so, but they are sensitive to how such instructions are phrased. This issue is especially concerning in healthcare, as clinicians are unlikely to be experienced prompt engineers and the potential consequences of inaccurate outputs are heightened in this domain. This raises a practical question: How robust are instruction-tuned LLMs to natural variations in the instructions provided for clinical NLP tasks? We collect prompts from medical doctors across a range of tasks and quantify the sensitivity of seven LLMs -- some general, others specialized -- to natural (i.e., non-adversarial) instruction phrasings. We find that performance varies substantially across all models, and that -- perhaps surprisingly -- domain-specific models explicitly trained on clinical data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
