MATRIX: Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation

Ernest Lim; Yajie Vera He; Jared Joselowitz; Kate Preston; Mohita Chowdhury; Louis Williams; Aisling Higham; Katrina Mason; Mariane Melo; Tom Lawton; Yan Jia; Ibrahim Habli

arXiv:2508.19163·cs.AI·August 27, 2025

MATRIX: Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation

Ernest Lim, Yajie Vera He, Jared Joselowitz, Kate Preston, Mohita Chowdhury, Louis Williams, Aisling Higham, Katrina Mason, Mariane Melo, Tom Lawton, Yan Jia, Ibrahim Habli

PDF

TL;DR

MATRIX is a comprehensive framework that evaluates clinical dialogue agents' safety and realism using structured scenarios, LLM-based failure detection, and simulated patient responses, advancing safety assessment in clinical AI systems.

Contribution

This paper introduces MATRIX, a novel, extensible framework combining safety engineering, LLM-based failure detection, and realistic patient simulation for clinical dialogue evaluation.

Findings

01

BehvJudge detects safety failures with high accuracy (F1 0.96)

02

PatBot reliably simulates realistic patient behavior

03

MATRIX enables systematic benchmarking of clinical LLMs across multiple scenarios

Abstract

Despite the growing use of large language models (LLMs) in clinical dialogue systems, existing evaluations focus on task completion or fluency, offering little insight into the behavioral and risk management requirements essential for safety-critical systems. This paper presents MATRIX (Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation), a structured, extensible framework for safety-oriented evaluation of clinical dialogue agents. MATRIX integrates three components: (1) a safety-aligned taxonomy of clinical scenarios, expected system behaviors and failure modes derived through structured safety engineering methods; (2) BehvJudge, an LLM-based evaluator for detecting safety-relevant dialogue failures, validated against expert clinician annotations; and (3) PatBot, a simulated patient agent capable of producing diverse,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.