Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases
Bonam Mingole, Aditya Majumdar, Firdaus Ahmed Choudhury, Jennifer L. Kraschnewski, Shyam S. Sundar, Amulya Yadav

TL;DR
This study evaluates the real-world effectiveness of large language models in answering everyday health queries through a crowdsourced approach, revealing that 76% of responses are deemed accurate by medical professionals.
Contribution
It introduces a novel crowdsourced evaluation method for LLMs on real-world health questions and assesses the impact of medical knowledge bases on response quality.
Findings
76% of LLM responses rated accurate by physicians
RAG models with medical knowledge bases can improve response quality
Qualitative insights from medical professionals explain quantitative results
Abstract
The proliferation of Large Language Models (LLMs) in high-stakes applications such as medical (self-)diagnosis and preliminary triage raises significant ethical and practical concerns about the effectiveness, appropriateness, and possible harmfulness of the use of these technologies for health-related concerns and queries. Some prior work has considered the effectiveness of LLMs in answering expert-written health queries/prompts, questions from medical examination banks, or queries based on pre-existing clinical cases. Unfortunately, these existing studies completely ignore an in-the-wild evaluation of the effectiveness of LLMs in answering everyday health concerns and queries typically asked by general users, which corresponds to the more prevalent use case for LLMs. To address this research gap, this paper presents the findings from a university-level competition that leveraged a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Dropout · Byte Pair Encoding · Softmax · Dense Connections · Layer Normalization · Linear Warmup With Linear Decay · BERT · BART
