Prompt Sensitivity and Answer Consistency of Small Open-Source Language Models for Clinical Question Answering in Low-Resource Healthcare
Shravani Hariprasad

TL;DR
This study evaluates the reliability of small open-source language models for clinical question answering in low-resource healthcare, highlighting the importance of assessing both consistency and accuracy to ensure safe AI deployment.
Contribution
It provides a comprehensive analysis of multiple models' performance and introduces insights into the relationship between consistency, accuracy, and instruction-following in clinical AI.
Findings
High consistency does not guarantee correctness.
Llama 3.2 balances accuracy and reliability well.
Domain pretraining alone is insufficient for clinical QA.
Abstract
Small open-source language models are gaining attention for healthcare applications in low-resource settings where cloud infrastructure and GPU hardware may be unavailable. However, the reliability of these models under different phrasings of the same clinical question remains poorly understood. We evaluate five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, and Meditron-7B, a domain-pretrained model without instruction tuning) across three clinical question answering datasets (MedQA, MedMCQA, and PubMedQA) using five prompt styles: original, formal, simplified, roleplay, and direct. Model behavior is evaluated using consistency scores, accuracy, and instruction-following failure rates. All experiments were conducted locally on consumer CPU hardware without fine-tuning. Consistency and accuracy were largely independent across models. Gemma 2 achieved the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Electronic Health Records Systems
