Clinical knowledge in LLMs does not translate to human interactions

Andrew M. Bean; Rebecca Payne; Guy Parsons; Hannah Rose Kirk; Juan; Ciro; Rafael Mosquera; Sara Hincapi\'e Monsalve; Aruna S. Ekanayaka; Lionel; Tarassenko; Luc Rocher; Adam Mahdi

arXiv:2504.18919·cs.HC·April 29, 2025·3 cites

Clinical knowledge in LLMs does not translate to human interactions

Andrew M. Bean, Rebecca Payne, Guy Parsons, Hannah Rose Kirk, Juan, Ciro, Rafael Mosquera, Sara Hincapi\'e Monsalve, Aruna S. Ekanayaka, Lionel, Tarassenko, Luc Rocher, Adam Mahdi

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This study reveals that while LLMs perform well on medical exams, their assistance in real-world human interactions for medical advice is ineffective, highlighting the need for human-centered testing before deployment.

Contribution

The paper demonstrates that LLMs' high medical knowledge does not translate into effective real-world assistance in healthcare scenarios, emphasizing the importance of human interaction testing.

Findings

01

LLMs correctly identify conditions in 94.9% of cases when tested alone.

02

Participants using LLMs identified relevant conditions less than 34.5% of the time.

03

Participants' ability to determine dispositions was less than 44.2%, not better than control.

Abstract

Global healthcare providers are exploring use of large language models (LLMs) to provide medical advice to the public. LLMs now achieve nearly perfect scores on medical licensing exams, but this does not necessarily translate to accurate performance in real-world settings. We tested if LLMs can assist members of the public in identifying underlying conditions and choosing a course of action (disposition) in ten medical scenarios in a controlled study with 1,298 participants. Participants were randomly assigned to receive assistance from an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control). Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in less than 34.5% of cases and disposition in less than 44.2%,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

am-bean/HELPMed
pytorchOfficial

Datasets

ambean/HELPMed
dataset· 39 dl
39 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Global Health and Surgery · Explainable Artificial Intelligence (XAI)

MethodsLLaMA