Evaluating Patient Safety Risks in Generative AI: Development and Validation of a FMECA Framework for Generated Clinical Content

Lydie Bednarczyk; Jamil Zaghir; Julien Ehrsam; Maria Tcherepanova; Christian Skalafouris; Karim Gariani; Catherine Geslin; Claire-B\'en\'edicte Rivara; Pascal Bonnabry; Laetitia Gosetto; Richard Dubos; Mina Bjelogrlic; Christophe Gaudet-Blavignac; Christian Lovis

arXiv:2605.04085·cs.CY·May 7, 2026

Evaluating Patient Safety Risks in Generative AI: Development and Validation of a FMECA Framework for Generated Clinical Content

Lydie Bednarczyk, Jamil Zaghir, Julien Ehrsam, Maria Tcherepanova, Christian Skalafouris, Karim Gariani, Catherine Geslin, Claire-B\'en\'edicte Rivara, Pascal Bonnabry, Laetitia Gosetto, Richard Dubos, Mina Bjelogrlic, Christophe Gaudet-Blavignac, Christian Lovis

PDF

TL;DR

This study develops and validates a novel FMECA framework to systematically assess patient safety risks in clinical summaries generated by large language models, enhancing proactive risk management.

Contribution

It introduces the first FMECA-based method tailored for evaluating safety risks in LLM-generated clinical content, with demonstrated reliability and usability.

Findings

01

Inter-rater reliability improved with training and rounds.

02

Framework identified 14 failure modes in clinical summaries.

03

Usability rated as good with high evaluator confidence.

Abstract

Objectives: Large language models (LLMs) are increasingly used for clinical text summarization, yet structured methods to assess associated patient safety risks remain limited. Failure Mode, Effects, and Criticality Analysis (FMECA) provides a proactive framework for systematic risk identification but has not been adapted to LLM-generated clinical content. This study aimed to develop and validate a novel FMECA framework for the prospective assessment of patient safety risks in LLM-generated clinical summaries. Materials and Methods: An interdisciplinary expert panel (n = 8) developed a taxonomy of failure modes through literature review and brainstorming. Standard FMECA dimensions (occurrence, severity, detectability) were adapted into 5-point ordinal scales. The framework was applied to 36 discharge summaries from four patients, generated by an open LLM (GPT-OSS 120B) using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.