Open (Clinical) LLMs are Sensitive to Instruction Phrasings

Alberto Mario Ceballos Arroyo; Monica Munnangi; Jiuding Sun; Karen; Y.C. Zhang; Denis Jered McInerney; Byron C. Wallace; Silvio Amir

arXiv:2407.09429·cs.CL·July 15, 2024

Open (Clinical) LLMs are Sensitive to Instruction Phrasings

Alberto Mario Ceballos Arroyo, Monica Munnangi, Jiuding Sun, Karen, Y.C. Zhang, Denis Jered McInerney, Byron C. Wallace, Silvio Amir

PDF

Open Access 1 Repo 1 Video

TL;DR

This study investigates how instruction-tuned Large Language Models (LLMs) used in healthcare are sensitive to variations in instruction phrasing, revealing significant performance and fairness fluctuations, especially in domain-specific models.

Contribution

It provides a systematic evaluation of the robustness of various clinical LLMs to natural instruction phrasing variations, highlighting unexpected brittleness in domain-specific models.

Findings

01

Performance varies significantly with instruction phrasing.

02

Domain-specific models are more brittle than general models.

03

Instruction phrasing impacts fairness across demographic groups.

Abstract

Instruction-tuned Large Language Models (LLMs) can perform a wide range of tasks given natural language instructions to do so, but they are sensitive to how such instructions are phrased. This issue is especially concerning in healthcare, as clinicians are unlikely to be experienced prompt engineers and the potential consequences of inaccurate outputs are heightened in this domain. This raises a practical question: How robust are instruction-tuned LLMs to natural variations in the instructions provided for clinical NLP tasks? We collect prompts from medical doctors across a range of tasks and quantify the sensitivity of seven LLMs -- some general, others specialized -- to natural (i.e., non-adversarial) instruction phrasings. We find that performance varies substantially across all models, and that -- perhaps surprisingly -- domain-specific models explicitly trained on clinical data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alceballosa/clin-robust
noneOfficial

Videos

Open (Clinical) LLMs are Sensitive to Instruction Phrasings· underline

Taxonomy

TopicsNatural Language Processing Techniques