MedPI: Evaluating AI Systems in Medical Patient-facing Interactions
Diego Fajardo V., Oleksii Proniakin, Victoria-Elisabeth Gruber, Razvan Marinescu

TL;DR
MedPI introduces a comprehensive, multi-dimensional benchmark for evaluating large language models in medical patient-clinician dialogues, assessing safety, communication, and diagnostic capabilities across 105 criteria.
Contribution
This work presents MedPI, a novel high-dimensional benchmark with a detailed evaluation framework and calibrated AI judges for assessing LLMs in complex medical conversations.
Findings
All evaluated LLMs show low performance on diagnostic and communication dimensions.
The benchmark reveals significant gaps in current models' medical understanding.
MedPI can guide future development of safer, more effective medical AI systems.
Abstract
We present MedPI, a high-dimensional benchmark for evaluating large language models (LLMs) in patient-clinician conversations. Unlike single-turn question-answer (QA) benchmarks, MedPI evaluates the medical dialogue across 105 dimensions comprising the medical process, treatment safety, treatment outcomes and doctor-patient communication across a granular, accreditation-aligned rubric. MedPI comprises five layers: (1) Patient Packets (synthetic EHR-like ground truth); (2) an AI Patient instantiated through an LLM with memory and affect; (3) a Task Matrix spanning encounter reasons (e.g. anxiety, pregnancy, wellness checkup) x encounter objectives (e.g. diagnosis, lifestyle advice, medication advice); (4) an Evaluation Framework with 105 dimensions on a 1-4 scale mapped to the Accreditation Council for Graduate Medical Education (ACGME) competencies; and (5) AI Judges that are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling
