How well do LLMs cite relevant medical references? An evaluation framework and analyses
Kevin Wu, Eric Wu, Ally Cassasola, Angela Zhang, Kevin Wei, Teresa, Nguyen, Sith Riantawan, Patricia Shi Riantawan, Daniel E. Ho, James Zou

TL;DR
This paper evaluates how accurately large language models cite relevant medical references, developing an automated pipeline and expert validation to assess the supportiveness of generated sources, revealing significant gaps in source reliability.
Contribution
It introduces SourceCheckup, an automated evaluation pipeline, and demonstrates that many LLM responses are not fully supported by their cited sources in medical contexts.
Findings
88% agreement between GPT-4 and medical experts on source relevance
50-90% of LLM responses lack full source support
Approximately 30% of GPT-4 statements are unsupported even with retrieval augmentation
Abstract
Large language models (LLMs) are currently being used to answer medical questions across a variety of clinical domains. Recent top-performing commercial LLMs, in particular, are also capable of citing sources to support their responses. In this paper, we ask: do the sources that LLMs generate actually support the claims that they make? To answer this, we propose three contributions. First, as expert medical annotations are an expensive and time-consuming bottleneck for scalable evaluation, we demonstrate that GPT-4 is highly accurate in validating source relevance, agreeing 88% of the time with a panel of medical doctors. Second, we develop an end-to-end, automated pipeline called \textit{SourceCheckup} and use it to evaluate five top-performing LLMs on a dataset of 1200 generated questions, totaling over 40K pairs of statements and sources. Interestingly, we find that between ~50% to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicslinguistics and terminology studies · Library Science and Information Systems · Biomedical Text Mining and Ontologies
MethodsAttention Is All You Need · Absolute Position Encodings · Linear Layer · Byte Pair Encoding · Multi-Head Attention · Adam · Residual Connection · Layer Normalization · Dense Connections · Position-Wise Feed-Forward Layer
