Prompt Sensitivity and Answer Consistency of Small Open-Source Language Models for Clinical Question Answering in Low-Resource Healthcare

Shravani Hariprasad

arXiv:2603.00917·cs.CL·March 18, 2026

Prompt Sensitivity and Answer Consistency of Small Open-Source Language Models for Clinical Question Answering in Low-Resource Healthcare

Shravani Hariprasad

PDF

Open Access

TL;DR

This study evaluates the reliability of small open-source language models for clinical question answering in low-resource healthcare, highlighting the importance of assessing both consistency and accuracy to ensure safe AI deployment.

Contribution

It provides a comprehensive analysis of multiple models' performance and introduces insights into the relationship between consistency, accuracy, and instruction-following in clinical AI.

Findings

01

High consistency does not guarantee correctness.

02

Llama 3.2 balances accuracy and reliability well.

03

Domain pretraining alone is insufficient for clinical QA.

Abstract

Small open-source language models are gaining attention for healthcare applications in low-resource settings where cloud infrastructure and GPU hardware may be unavailable. However, the reliability of these models under different phrasings of the same clinical question remains poorly understood. We evaluate five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, and Meditron-7B, a domain-pretrained model without instruction tuning) across three clinical question answering datasets (MedQA, MedMCQA, and PubMedQA) using five prompt styles: original, formal, simplified, roleplay, and direct. Model behavior is evaluated using consistency scores, accuracy, and instruction-following failure rates. All experiments were conducted locally on consumer CPU hardware without fine-tuning. Consistency and accuracy were largely independent across models. Gemma 2 achieved the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Electronic Health Records Systems