TL;DR
This study evaluates clinical language models' empathy, readability, and alignment, revealing they excel as communication tools but do not surpass physicians in accuracy.
Contribution
It provides a comprehensive multidimensional assessment of clinical LLMs, highlighting the impact of prompting and rewriting on alignment and readability.
Findings
Empathy prompts reduce negativity and complexity but don't improve semantic fidelity.
Rephrasing improves semantic similarity and readability, reducing affective extremity.
Models are preferred by patients for clarity and emotional tone but don't outperform physicians in epistemic accuracy.
Abstract
Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
