Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework

Avi-ad Avraam Buskila

arXiv:2604.10535·cs.IR·April 14, 2026

Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework

Avi-ad Avraam Buskila

PDF

1 Repo

TL;DR

This paper introduces a practical, open-source framework for evaluating small open-weight LLMs in medical question answering, emphasizing reproducibility and correctness to ensure reliable medical advice.

Contribution

It presents a comprehensive evaluation pipeline that measures quality and reproducibility of small LLMs in medical QA, highlighting safety gaps in current benchmarks.

Findings

01

Self-agreement across runs is at most 0.20, indicating low reproducibility.

02

87-97% of outputs per model are unique, revealing a safety gap.

03

MedGemma underperforms larger models on quality and reproducibility.

Abstract

Incorporating large language models (LLMs) in medical question answering demands more than high average accuracy: a model that returns substantively different answers each time it is queried is not a reliable medical tool. Online health communities such as Reddit have become a primary source of medical information for millions of users, yet they remain highly susceptible to misinformation; deploying LLMs as assistants in these settings amplifies the need for output consistency alongside correctness. We present a practical, open-source evaluation framework for assessing small, locally-deployable open-weight LLMs on medical question answering, treating reproducibility as a first-class metric alongside lexical and semantic accuracy. Our pipeline computes eight quality metrics, including BERTScore, ROUGE-L, and an LLM-as-judge rubric, together with two within-model reproducibility metrics…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aviad-buskila/llm_medical_reproducibility
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.