Quantifying Hallucinations in Language Language Models on Medical Textbooks
Brandon C. Colelough, Davis Bartels, Dina Demner-Fushman

TL;DR
This study quantifies hallucination rates in medical question-answering by large language models, revealing significant inaccuracies and emphasizing the need for human oversight in clinical applications.
Contribution
It provides the first systematic measurement of hallucination prevalence in textbook-grounded medical QA across multiple models and evaluates clinician preferences and agreement.
Findings
LLaMA-70B-Instruct hallucinated in 19.7% of answers with provided passages.
Higher usefulness scores correlated with lower hallucination rates ($\rho=-0.71$).
Clinicians showed high agreement in evaluating model responses.
Abstract
Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments, the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given closed-source zero-shot prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
