Quantifying Hallucinations in Language Language Models on Medical Textbooks

Brandon C. Colelough; Davis Bartels; Dina Demner-Fushman

arXiv:2603.09986·cs.CL·May 8, 2026

Quantifying Hallucinations in Language Language Models on Medical Textbooks

Brandon C. Colelough, Davis Bartels, Dina Demner-Fushman

PDF

TL;DR

This study quantifies hallucination rates in medical question-answering by large language models, revealing significant inaccuracies and emphasizing the need for human oversight in clinical applications.

Contribution

It provides the first systematic measurement of hallucination prevalence in textbook-grounded medical QA across multiple models and evaluates clinician preferences and agreement.

Findings

01

LLaMA-70B-Instruct hallucinated in 19.7% of answers with provided passages.

02

Higher usefulness scores correlated with lower hallucination rates ($\rho=-0.71$).

03

Clinicians showed high agreement in evaluating model responses.

Abstract

Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments, the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given closed-source zero-shot prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.