# Evaluating a large language model's ability to answer clinicians' requests for evidence summaries

**Authors:** Mallory N. Blasingame, Taneya Y. Koonce, Annette M. Williams, Dario A. Giuse, Jing Su, Poppy A. Krump, Nunzia Bettinsoli Giuse

PMC · DOI: 10.5195/jmla.2025.1985 · Journal of the Medical Library Association : JMLA · 2025-01-14

## TL;DR

This study compares GPT-4's ability to answer clinical questions with medical librarians' gold-standard responses, finding promising but imperfect performance.

## Contribution

The study evaluates GPT-4's performance in generating clinical evidence summaries compared to medical librarians and introduces aiChat as an internally managed tool.

## Key findings

- GPT-4 provided correct responses to 83.3% of clinical questions.
- 37% of the references provided by aiChat were confirmed as nonfabricated.
- Performance was consistent across question categories.

## Abstract

This study investigated the performance of a generative artificial intelligence (AI) tool using GPT-4 in answering clinical questions in comparison with medical librarians' gold-standard evidence syntheses.

Questions were extracted from an in-house database of clinical evidence requests previously answered by medical librarians. Questions with multiple parts were subdivided into individual topics. A standardized prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internally managed chat tool using GPT-4, and recorded the responses. The summaries generated by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat.

Of the 216 evaluated questions, aiChat's response was assessed as “correct” for 180 (83.3%) questions, “partially correct” for 35 (16.2%) questions, and “incorrect” for 1 (0.5%) question. No significant differences were observed in question ratings by question category (p=0.73). For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as nonfabricated.

Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate. Thus, we envision this being the first of a series of investigations designed to further our understanding of how current and future versions of generative AI can be used and integrated into medical librarians' workflow.

## Full-text entities

- **Diseases:** AI (MESH:C538142), SIADH (MESH:D007177), hallucinated (MESH:D006212), hyponatremia (MESH:D007010)
- **Chemicals:** water (MESH:D014867), Su (-), NaCl (MESH:D012965)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11835037/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC11835037/full.md

## References

85 references — full list in the complete paper: https://tomesphere.com/paper/PMC11835037/full.md

---
Source: https://tomesphere.com/paper/PMC11835037