MedRiskEval: Medical Risk Evaluation Benchmark of Language Models, On the Importance of User Perspectives in Healthcare Settings

Jean-Philippe Corbeil; Minseon Kim; Maxime Griot; Sheela Agarwal; Alessandro Sordoni; Francois Beaulieu; Paul Vozila

arXiv:2507.07248·cs.CL·January 12, 2026

MedRiskEval: Medical Risk Evaluation Benchmark of Language Models, On the Importance of User Perspectives in Healthcare Settings

Jean-Philippe Corbeil, Minseon Kim, Maxime Griot, Sheela Agarwal, Alessandro Sordoni, Francois Beaulieu, Paul Vozila

PDF

10 Models 1 Datasets 1 Video

TL;DR

MedRiskEval introduces a comprehensive benchmark for evaluating medical language models, emphasizing user perspectives and patient safety to promote safer deployment in healthcare.

Contribution

The paper presents MedRiskEval, a novel risk evaluation benchmark including a patient-oriented dataset, addressing safety concerns for diverse healthcare user groups.

Findings

01

Evaluated multiple LLMs on the new benchmark

02

Identified safety risks across different user perspectives

03

Provided insights for safer medical AI deployment

Abstract

As the performance of large language models (LLMs) continues to advance, their adoption in the medical domain is increasing. However, most existing risk evaluations largely focused on general safety benchmarks. In the medical applications, LLMs may be used by a wide range of users, ranging from general users and patients to clinicians, with diverse levels of expertise and the model's outputs can have a direct impact on human health which raises serious safety concerns. In this paper, we introduce MedRiskEval, a medical risk evaluation benchmark tailored to the medical domain. To fill the gap in previous benchmarks that only focused on the clinician perspective, we introduce a new patient-oriented dataset called PatientSafetyBench containing 466 samples across 5 critical risk categories. Leveraging our new benchmark alongside existing datasets, we evaluate a variety of open- and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

microsoft/PatientSafetyBench
dataset· 90 dl
90 dl

Videos

MedRiskEval: Medical Risk Evaluation Benchmark of Language Models, On the Importance of User Perspectives in Healthcare Settings· underline