Reply: “Continuing the Chat: How Can We Improve the Performance of an Artificial Intelligence Chatbot in Answering Clinical Infectious Diseases Pharmacotherapy Questions?”

Wesley D Kufel; Conan MacDougall; Elizabeth W Covington; Jason C Gallagher; Robert W Seabury; Jeffrey M Steele

PMC · DOI:10.1093/ofid/ofaf074·February 6, 2025

Reply: “Continuing the Chat: How Can We Improve the Performance of an Artificial Intelligence Chatbot in Answering Clinical Infectious Diseases Pharmacotherapy Questions?”

Wesley D Kufel, Conan MacDougall, Elizabeth W Covington, Jason C Gallagher, Robert W Seabury, Jeffrey M Steele

PDF

Open Access

Abstract

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases1

Infectious Diseases

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Clinical Reasoning and Diagnostic Skills · COVID-19 diagnosis using AI

Full text

To the Editor—We thank Koh and colleagues for their correspondence and commend the authors for evaluating responses on our 100 developed clinical infectious diseases (ID) pharmacotherapy questions using the same rating scales and definitions [1]. A comparison of the ratings between our study and Koh and colleagues' evaluation can be found in Table 1 [1, 2]. The median correctness and safety ratings were both 7 and 8, respectively. However, Koh and colleagues found a higher median completeness rating of 6 as compared with 5 as described in our study. The most notable finding was the difference in the responses that were deemed useful. They rated the responses by GPT-4o to be 93% useful, as opposed to 41.8% as identified in our study.

The authors used multiple techniques to address some of the shortcomings described in our study and improve ChatGPT performance and response quality [1]. Unfortunately, GPT-4o was not available at the time of our study, and this version has demonstrated improved performance in biomedical knowledge domains and reductions in hallucination as compared with GPT-3.5 [3–5]. Furthermore, the versions of ChatGPT continue to evolve and be updated at a rapid pace, which makes timely evaluation of these versions logistically challenging [2]. Currently, the o1 model has been shown to have even better performance with its ability to reason through problems [6]. Their use of detailed prompts helped improve response quality and should be applied when clinicians ask clinical questions to ChatGPT; however, it is unclear if most clinicians are aware of detailed prompt functionality and know how to effectively use it in their routine practice.

The number of responses deemed useful by Koh and colleagues was higher than in our study. Of note, they had 2 investigators who were ID physicians evaluate the responses, as opposed to 5 ID pharmacists in our study. There is also likely some difference in evaluation standards among health care professionals despite using the same definitions. The generated questions were clinical ID pharmacotherapy questions that ID pharmacists receive and are perhaps best equipped to answer on the basis of their education, training, and clinical experience. ID physicians rating these responses as useful may have some degree of rater bias with determining that the responses are useful, when there are indeed potential limitations with the GPT-4o responses. In contrast, ID pharmacists would likely be much more satisfied with ChatGPT responses to ID diagnostic questions than ID physicians would be.

While we do not have the specific rating scores for each response, we did evaluate some of the GPT-4o responses for accuracy in our interpretation of useful responses. The response to question 83 suggests using an alternative antibiotic to cefazolin for surgical site prophylaxis in a patient with anaphylaxis to penicillin despite cefazolin being able to be used in this scenario [7]; therefore, this would not be rated as useful. Furthermore, some responses still hallucinate. For example, the response to question 99 states that ampicillin susceptibility does not imply meropenem susceptibility in Enterococcus faecalis, which is appropriate and useful; however, the response further states that enterococci are inherently resistant to carbapenems due to a lack of target affinity, which is incorrect and not useful since imipenem-cilastatin does indeed have activity against E faecalis, with ampicillin serving as a surrogate for susceptibility [8]. There are also some issues with the sources cited in the responses. For example, the response to question 27 cited a source of “Infectious Diseases Society of America Epidural Abscess Treatment Recommendations,” yet such guidelines do not currently exist. While these are only a couple of examples that raise concern, these help to support that the useful rating of 93% may be overstated.

The rapidly changing nature of the artificial intelligence field and the challenges of keeping up with peer-reviewed publication enterprise make evaluations of artificial intelligence on clinical practice challenging. Perhaps a future opportunity could include presenting models with more real-world ID cases that focus on balancing risks and benefits to see how ChatGPT compares with human clinical judgment and priorities. Nonetheless, we agree with Koh and colleagues that while GPT-4o improved some of the shortcomings identified with GPT-3.5, ChatGPT still cannot replace a human health care professional's judgment and management in clinical practice.

Bibliography8

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Koh MCY, Smitasin N, Tambyah PA, Mgiam JN. Continuing the chat: how can we improve the performance of an artificial intelligence chatbot in answering clinical infectious diseases pharmacotherapy questions? Open Forum Infect Dis. Published online 2025.10.1093/ofid/ofaf 073PMC 1184559739990632 · doi ↗ · pubmed ↗
2Kufel WD, Hanrahan KD, Seabury RW, et al Let's have a chat: how well does an artificial intelligence chatbot answer clinical infectious diseases pharmacotherapy questions? Open Forum Infect Dis 2024; 11:ofae 641.39529938 10.1093/ofid/ofae 641PMC 11551448 · doi ↗ · pubmed ↗
3Katz U, Cohen E, Shachar E, et al GPT versus resident physicians—a benchmark based on official board scores. NEJM AI 2024; 1. doi:10.1056/A Idbp 2300192 · doi ↗
4Nori H, King N, Mc Kinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on medical challenge problems. ar Xiv. Posted online 2023. Available at: https://arxiv.org/abs/2303.13375.
5Brin D, Sorin V, Vaid A, et al Comparing Chat GPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep 2023; 13:16492.37779171 10.1038/s 41598-023-43436-9PMC 10543445 · doi ↗ · pubmed ↗
6Brodeur PG, Buckley TA, Kanjee Z, et al Superhuman performance of a large language model on the reasoning tasks of a physician. ar Xiv. Posted online 2024. Available at: https://www.arxiv.org/abs/2412.10849.
7Sousa-Pinto B, Blumenthal KG, Courtney L, Mancini CM, Jeffres MN. Assessment of the frequency of dual allergy to penicillins and cefazolin: a systematic review and meta-analysis. JAMA Surg 2021; 156:e 210021.33729459 10.1001/jamasurg.2021.0021 PMC 7970387 · doi ↗ · pubmed ↗
8Clinical Laboratory Standards Institute . Performance standards for antimicrobial susceptibility testing. 34th ed. Berwyn, PA: Clinical Laboratory Standards Institute, 2024. CLSI supplement M 100.