MedPI: Evaluating AI Systems in Medical Patient-facing Interactions

Diego Fajardo V.; Oleksii Proniakin; Victoria-Elisabeth Gruber; Razvan Marinescu

arXiv:2601.04195·cs.CL·January 9, 2026

MedPI: Evaluating AI Systems in Medical Patient-facing Interactions

Diego Fajardo V., Oleksii Proniakin, Victoria-Elisabeth Gruber, Razvan Marinescu

PDF

Open Access

TL;DR

MedPI introduces a comprehensive, multi-dimensional benchmark for evaluating large language models in medical patient-clinician dialogues, assessing safety, communication, and diagnostic capabilities across 105 criteria.

Contribution

This work presents MedPI, a novel high-dimensional benchmark with a detailed evaluation framework and calibrated AI judges for assessing LLMs in complex medical conversations.

Findings

01

All evaluated LLMs show low performance on diagnostic and communication dimensions.

02

The benchmark reveals significant gaps in current models' medical understanding.

03

MedPI can guide future development of safer, more effective medical AI systems.

Abstract

We present MedPI, a high-dimensional benchmark for evaluating large language models (LLMs) in patient-clinician conversations. Unlike single-turn question-answer (QA) benchmarks, MedPI evaluates the medical dialogue across 105 dimensions comprising the medical process, treatment safety, treatment outcomes and doctor-patient communication across a granular, accreditation-aligned rubric. MedPI comprises five layers: (1) Patient Packets (synthetic EHR-like ground truth); (2) an AI Patient instantiated through an LLM with memory and affect; (3) a Task Matrix spanning encounter reasons (e.g. anxiety, pregnancy, wellness checkup) x encounter objectives (e.g. diagnosis, lifestyle advice, medication advice); (4) an Evaluation Framework with 105 dimensions on a 1-4 scale mapped to the Accreditation Council for Graduate Medical Education (ACGME) competencies; and (5) AI Judges that are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling