Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases

Bonam Mingole; Aditya Majumdar; Firdaus Ahmed Choudhury; Jennifer L. Kraschnewski; Shyam S. Sundar; Amulya Yadav

arXiv:2506.13805·cs.CY·June 18, 2025

Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases

Bonam Mingole, Aditya Majumdar, Firdaus Ahmed Choudhury, Jennifer L. Kraschnewski, Shyam S. Sundar, Amulya Yadav

PDF

Open Access

TL;DR

This study evaluates the real-world effectiveness of large language models in answering everyday health queries through a crowdsourced approach, revealing that 76% of responses are deemed accurate by medical professionals.

Contribution

It introduces a novel crowdsourced evaluation method for LLMs on real-world health questions and assesses the impact of medical knowledge bases on response quality.

Findings

01

76% of LLM responses rated accurate by physicians

02

RAG models with medical knowledge bases can improve response quality

03

Qualitative insights from medical professionals explain quantitative results

Abstract

The proliferation of Large Language Models (LLMs) in high-stakes applications such as medical (self-)diagnosis and preliminary triage raises significant ethical and practical concerns about the effectiveness, appropriateness, and possible harmfulness of the use of these technologies for health-related concerns and queries. Some prior work has considered the effectiveness of LLMs in answering expert-written health queries/prompts, questions from medical examination banks, or queries based on pre-existing clinical cases. Unfortunately, these existing studies completely ignore an in-the-wild evaluation of the effectiveness of LLMs in answering everyday health concerns and queries typically asked by general users, which corresponds to the more prevalent use case for LLMs. To address this research gap, this paper presents the findings from a university-level competition that leveraged a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Dropout · Byte Pair Encoding · Softmax · Dense Connections · Layer Normalization · Linear Warmup With Linear Decay · BERT · BART