AutoMedic: An Automated Evaluation Framework for Clinical Conversational Agents with Medical Dataset Grounding

Gyutaek Oh; Sangjoon Park; Byung-Hoon Kim

arXiv:2512.10195·cs.CL·December 12, 2025

AutoMedic: An Automated Evaluation Framework for Clinical Conversational Agents with Medical Dataset Grounding

Gyutaek Oh, Sangjoon Park, Byung-Hoon Kim

PDF

Open Access

TL;DR

AutoMedic is a framework that transforms static medical datasets into simulated clinical conversations to evaluate large language models' performance across multiple important clinical criteria.

Contribution

It introduces AutoMedic, a novel multi-agent simulation framework that enables automated, realistic, and multi-faceted evaluation of clinical conversational agents.

Findings

01

AutoMedic's CARE metric correlates well with human expert judgments.

02

AutoMedic effectively assesses LLMs' accuracy, empathy, and robustness in clinical dialogues.

03

Validated by human experts, AutoMedic provides reliable automated evaluation results.

Abstract

Evaluating large language models (LLMs) has recently emerged as a critical issue for safe and trustworthy application of LLMs in the medical domain. Although a variety of static medical question-answering (QA) benchmarks have been proposed, many aspects remain underexplored, such as the effectiveness of LLMs in generating responses in dynamic, interactive clinical multi-turn conversation situations and the identification of multi-faceted evaluation strategies beyond simple accuracy. However, formally evaluating a dynamic, interactive clinical situation is hindered by its vast combinatorial space of possible patient states and interaction trajectories, making it difficult to standardize and quantitatively measure such scenarios. Here, we introduce AutoMedic, a multi-agent simulation framework that enables automated evaluation of LLMs as clinical conversational agents. AutoMedic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Healthcare