How well do LLMs cite relevant medical references? An evaluation   framework and analyses

Kevin Wu; Eric Wu; Ally Cassasola; Angela Zhang; Kevin Wei; Teresa; Nguyen; Sith Riantawan; Patricia Shi Riantawan; Daniel E. Ho; James Zou

arXiv:2402.02008·cs.CL·February 6, 2024·39 cites

How well do LLMs cite relevant medical references? An evaluation framework and analyses

Kevin Wu, Eric Wu, Ally Cassasola, Angela Zhang, Kevin Wei, Teresa, Nguyen, Sith Riantawan, Patricia Shi Riantawan, Daniel E. Ho, James Zou

PDF

Open Access 1 Repo

TL;DR

This paper evaluates how accurately large language models cite relevant medical references, developing an automated pipeline and expert validation to assess the supportiveness of generated sources, revealing significant gaps in source reliability.

Contribution

It introduces SourceCheckup, an automated evaluation pipeline, and demonstrates that many LLM responses are not fully supported by their cited sources in medical contexts.

Findings

01

88% agreement between GPT-4 and medical experts on source relevance

02

50-90% of LLM responses lack full source support

03

Approximately 30% of GPT-4 statements are unsupported even with retrieval augmentation

Abstract

Large language models (LLMs) are currently being used to answer medical questions across a variety of clinical domains. Recent top-performing commercial LLMs, in particular, are also capable of citing sources to support their responses. In this paper, we ask: do the sources that LLMs generate actually support the claims that they make? To answer this, we propose three contributions. First, as expert medical annotations are an expensive and time-consuming bottleneck for scalable evaluation, we demonstrate that GPT-4 is highly accurate in validating source relevance, agreeing 88% of the time with a panel of medical doctors. Second, we develop an end-to-end, automated pipeline called \textit{SourceCheckup} and use it to evaluate five top-performing LLMs on a dataset of 1200 generated questions, totaling over 40K pairs of statements and sources. Interestingly, we find that between ~50% to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kevinwu23/SourceCheckup
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topicslinguistics and terminology studies · Library Science and Information Systems · Biomedical Text Mining and Ontologies

MethodsAttention Is All You Need · Absolute Position Encodings · Linear Layer · Byte Pair Encoding · Multi-Head Attention · Adam · Residual Connection · Layer Normalization · Dense Connections · Position-Wise Feed-Forward Layer