TL;DR
This paper introduces a practical, open-source framework for evaluating small open-weight LLMs in medical question answering, emphasizing reproducibility and correctness to ensure reliable medical advice.
Contribution
It presents a comprehensive evaluation pipeline that measures quality and reproducibility of small LLMs in medical QA, highlighting safety gaps in current benchmarks.
Findings
Self-agreement across runs is at most 0.20, indicating low reproducibility.
87-97% of outputs per model are unique, revealing a safety gap.
MedGemma underperforms larger models on quality and reproducibility.
Abstract
Incorporating large language models (LLMs) in medical question answering demands more than high average accuracy: a model that returns substantively different answers each time it is queried is not a reliable medical tool. Online health communities such as Reddit have become a primary source of medical information for millions of users, yet they remain highly susceptible to misinformation; deploying LLMs as assistants in these settings amplifies the need for output consistency alongside correctness. We present a practical, open-source evaluation framework for assessing small, locally-deployable open-weight LLMs on medical question answering, treating reproducibility as a first-class metric alongside lexical and semantic accuracy. Our pipeline computes eight quality metrics, including BERTScore, ROUGE-L, and an LLM-as-judge rubric, together with two within-model reproducibility metrics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
