How Much Would a Clinician Edit This Draft? Evaluating LLM Alignment for Patient Message Response Drafting

Parker Seegmiller; Joseph Gatto; Sarah E. Greer; Ganza Belise Isingizwe; Rohan Ray; Timothy E. Burdick; Sarah Masud Preum

arXiv:2601.11344·cs.CL·January 19, 2026

How Much Would a Clinician Edit This Draft? Evaluating LLM Alignment for Patient Message Response Drafting

Parker Seegmiller, Joseph Gatto, Sarah E. Greer, Ganza Belise Isingizwe, Rohan Ray, Timothy E. Burdick, Sarah Masud Preum

PDF

Open Access 1 Datasets

TL;DR

This study evaluates how well large language models can generate patient message responses aligned with individual clinicians' preferences, highlighting the challenges and potential strategies for improving their integration into clinical workflows.

Contribution

It introduces a new taxonomy and evaluation framework for assessing LLMs in clinician response drafting, along with an expert-annotated dataset and comprehensive evaluation of adaptation techniques.

Findings

01

LLMs show variable ability to generate clinician-aligned responses across themes.

02

Theme-driven adaptation improves LLM performance in most response themes.

03

Significant epistemic uncertainty exists in aligning LLM outputs with clinician preferences.

Abstract

Large language models (LLMs) show promise in drafting responses to patient portal messages, yet their integration into clinical workflows raises various concerns, including whether they would actually save clinicians time and effort in their portal workload. We investigate LLM alignment with individual clinicians through a comprehensive evaluation of the patient message response drafting task. We develop a novel taxonomy of thematic elements in clinician responses and propose a novel evaluation framework for assessing clinician editing load of LLM-drafted responses at both content and theme levels. We release an expert-annotated dataset and conduct large-scale evaluations of local and commercial LLMs using various adaptation techniques including thematic prompting, retrieval-augmented generation, supervised fine-tuning, and direct preference optimization. Our results reveal substantial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

PortalPal-AI/Patient-Message-Response-Drafting
dataset· 646 dl
646 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Electronic Health Records Systems · Artificial Intelligence in Healthcare and Education