Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection

Md Tanvir Rouf Shawon; Mohammad Sabik Irbaz; Hadeel R. A. Elyazori; Keerti Reddy Resapu; Yili Lin; Vladimir Franzuela Cardenas; Farrokh Alemi; Kevin Lybarger

arXiv:2602.11391·cs.CL·March 30, 2026

Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection

Md Tanvir Rouf Shawon, Mohammad Sabik Irbaz, Hadeel R. A. Elyazori, Keerti Reddy Resapu, Yili Lin, Vladimir Franzuela Cardenas, Farrokh Alemi, Kevin Lybarger

PDF

TL;DR

This paper presents a patient simulator for evaluating healthcare conversational agents, revealing performance risks related to health literacy and demonstrating high fidelity in medical concept simulation.

Contribution

It introduces a novel, multi-profile patient simulator grounded in the NIST AI Risk Management Framework for systematic risk assessment of conversational healthcare AI.

Findings

01

Performance degrades with lower health literacy levels

02

High medical concept fidelity validated by human and LLM judges

03

Behavioral profiles are reliably distinguished with high agreement

Abstract

Objective: This paper introduces a patient simulator for scalable, automated evaluation of healthcare conversational agents, generating realistic, controllable interactions that systematically vary across medical, linguistic, and behavioral dimensions to support risk assessment across populations. Methods: Grounded in the NIST AI Risk Management Framework, the simulator integrates three profile components: (1) medical profiles constructed from All of Us electronic health records using risk-ratio gating; (2) linguistic profiles modeling health literacy and condition-specific communication; and (3) behavioral profiles representing cooperative, distracted, and adversarial engagement. Profiles were evaluated against NIST AI RMF trustworthiness requirements and assessed against an AI Decision Aid for antidepressant selection. Results: Across 500 simulated conversations, the simulator…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.