EPPCMinerBen: A Novel Benchmark for Evaluating Large Language Models on Electronic Patient-Provider Communication via the Patient Portal

Samah Fodeh; Yan Wang; Linhai Ma; Srivani Talakokkul; Jordan M. Alpert; Sarah Schellhorn

arXiv:2603.00028·cs.CL·March 3, 2026

EPPCMinerBen: A Novel Benchmark for Evaluating Large Language Models on Electronic Patient-Provider Communication via the Patient Portal

Samah Fodeh, Yan Wang, Linhai Ma, Srivani Talakokkul, Jordan M. Alpert, Sarah Schellhorn

PDF

Open Access

TL;DR

EPPCMinerBen is a new benchmark dataset designed to evaluate large language models on their ability to analyze and extract insights from electronic patient-provider communication messages, focusing on classification and evidence extraction tasks.

Contribution

This paper introduces EPPCMinerBen, a comprehensive benchmark with annotated data and evaluation tasks for assessing LLMs in healthcare communication analysis, highlighting model performance differences.

Findings

01

Llama-3.1-70B achieved highest evidence extraction F1 score (82.84%)

02

Instruction-tuned models outperform smaller models in most tasks

03

Few-shot prompting enhances model performance across tasks

Abstract

Effective communication in health care is critical for treatment outcomes and adherence. With patient-provider exchanges shifting to secure messaging, analyzing electronic patient-communication (EPPC) data is both essential and challenging. We introduce EPPCMinerBen, a benchmark for evaluating LLMs in detecting communication patterns and extracting insights from electronic patient-provider messages. EPPCMinerBen includes three sub-tasks: Code Classification, Subcode Classification, and Evidence Extraction. Using 1,933 expert annotated sentences from 752 secure messages of the patient portal at Yale New Haven Hospital, it evaluates LLMs on identifying communicative intent and supportive text. Benchmarks span various LLMs under zero-shot and few-shot settings, with data to be released via the NCI Cancer Data Service. Model performance varied across tasks and settings. Llama-3.1-70B led in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Topic Modeling · Health Literacy and Information Accessibility