Real World Human-LLM Interactions – Prospective blinded versus unblinded expert physician assessments of LLM responses to complex medical dilemmas

Itamar Ben Shitrit; Daphna Idan; Mark Volevich; Hadar Sharabi Goldenberg; Dolev Vaknin; Or Degany; Nitzan Abelson; Yair Binyamin; Raouf Nassar; Majd Nassar; Aviya Kedmi; Alexander Zlotnik; Sharon Einav

PMC · DOI:10.1371/journal.pdig.0001278·March 12, 2026

Real World Human-LLM Interactions – Prospective blinded versus unblinded expert physician assessments of LLM responses to complex medical dilemmas

Itamar Ben Shitrit, Daphna Idan, Mark Volevich, Hadar Sharabi Goldenberg, Dolev Vaknin, Or Degany, Nitzan Abelson, Yair Binyamin, Raouf Nassar, Majd Nassar, Aviya Kedmi, Alexander Zlotnik, Sharon Einav

PDF

Open Access

TL;DR

This study explores how physicians rate responses from large language models (LLMs) in real clinical scenarios, finding that physician satisfaction does not reliably reflect the quality of LLM-generated medical content.

Contribution

The study introduces a novel approach to evaluating LLMs in healthcare by comparing physician ratings of LLM and human-generated responses in a blinded setting.

Findings

01

Physician satisfaction scores were similar for LLM and human-generated responses in a blinded evaluation.

02

Satisfaction did not correlate with the accuracy of cited literature in the responses.

03

Physician resistance to change did not affect their ratings of LLM responses.

Abstract

Current evaluations of large language models (LLMs) in healthcare have largely emphasized theoretical benchmarks and clinician oversight, with limited exploration of real-world physician-AI interaction. In this two-stage prospective study, we assessed physician satisfaction with LLM-generated responses to real clinical queries. This study did not evaluate clinical accuracy, patient outcomes, or patient safety. In the first unblinded stage, physicians used three models - a general-purpose model (GPT-4o), a reasoning-focused model (GPT-o1), and a healthcare-specific model (OpenEvidence) - to address 25 clinical dilemmas - and rated the quality of the responses. In the second blinded stage, the same physicians evaluated responses generated either by an LLM or by a human alone, without knowledge of the source. Across 100 real-world medical responses, median physician scores on a 5-point…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Figures6

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Electronic Health Records Systems