Measuring Stability Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology

Vanessa D'Amario; Randy Daniel; Alessandro Zanetti; Dhruv Edamadaka; Nitya Alaparthy; Joshua Tarkoff

arXiv:2601.11567·cs.CL·January 21, 2026

Measuring Stability Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology

Vanessa D'Amario, Randy Daniel, Alessandro Zanetti, Dhruv Edamadaka, Nitya Alaparthy, Joshua Tarkoff

PDF

Open Access

TL;DR

This study evaluates small open-source medical LLMs in pediatric endocrinology, revealing that high response consistency does not equate to correctness and highlighting issues with reproducibility and diagnostic reliability.

Contribution

It introduces a comprehensive evaluation framework beyond accuracy, assessing consistency, robustness, and reasoning in medical LLMs, and uncovers critical limitations in their clinical deployment.

Findings

01

High consistency does not imply correctness.

02

Prompt variations cause divergent outputs despite stable accuracy.

03

System-level perturbations significantly affect model outputs.

Abstract

Small open-source medical large language models (LLMs) offer promising opportunities for low-resource deployment and broader accessibility. However, their evaluation is often limited to accuracy on medical multiple choice question (MCQ) benchmarks, and lacks evaluation of consistency, robustness, or reasoning behavior. We use MCQ coupled to human evaluation and clinical review to assess six small open-source medical LLMs (HuatuoGPT-o1 (Chen 2024), Diabetica-7B, Diabetica-o1 (Wei 2024), Meditron3-8B (Sallinen2025), MedFound-7B (Liu 2025), and ClinicaGPT-base-zh (Wang 2023)) in pediatric endocrinology. In deterministic settings, we examine the effect of prompt variation on models' output and self-assessment bias. In stochastic settings, we evaluate output variability and investigate the relationship between consistency and correctness. HuatuoGPT-o1-8B achieved the highest performance. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Genomics and Rare Diseases · Machine Learning in Healthcare