MATRIX: Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation
Ernest Lim, Yajie Vera He, Jared Joselowitz, Kate Preston, Mohita Chowdhury, Louis Williams, Aisling Higham, Katrina Mason, Mariane Melo, Tom Lawton, Yan Jia, Ibrahim Habli

TL;DR
MATRIX is a comprehensive framework that evaluates clinical dialogue agents' safety and realism using structured scenarios, LLM-based failure detection, and simulated patient responses, advancing safety assessment in clinical AI systems.
Contribution
This paper introduces MATRIX, a novel, extensible framework combining safety engineering, LLM-based failure detection, and realistic patient simulation for clinical dialogue evaluation.
Findings
BehvJudge detects safety failures with high accuracy (F1 0.96)
PatBot reliably simulates realistic patient behavior
MATRIX enables systematic benchmarking of clinical LLMs across multiple scenarios
Abstract
Despite the growing use of large language models (LLMs) in clinical dialogue systems, existing evaluations focus on task completion or fluency, offering little insight into the behavioral and risk management requirements essential for safety-critical systems. This paper presents MATRIX (Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation), a structured, extensible framework for safety-oriented evaluation of clinical dialogue agents. MATRIX integrates three components: (1) a safety-aligned taxonomy of clinical scenarios, expected system behaviors and failure modes derived through structured safety engineering methods; (2) BehvJudge, an LLM-based evaluator for detecting safety-relevant dialogue failures, validated against expert clinician annotations; and (3) PatBot, a simulated patient agent capable of producing diverse,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
