Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI
Anna Kozlova, Stanislau Salavei, Pavel Satalkin, Hanna Plotnitskaya, Sergey Parfenyuk

TL;DR
Doctorina MedBench is a comprehensive framework for evaluating agent-based medical AI through realistic clinical dialogue simulations, assessing diagnostic accuracy, efficiency, and safety in a variety of scenarios.
Contribution
It introduces a novel simulation-based evaluation framework with multi-level testing and a new metric, D.O.T.S., for assessing medical AI performance in realistic interactions.
Findings
Simulation provides more realistic assessment of clinical competence.
The framework supports safety monitoring and regression testing.
Over 1,000 clinical cases with 750 diagnoses are included.
Abstract
We present Doctorina MedBench, a comprehensive evaluation framework for agent-based medical AI based on the simulation of realistic physician-patient interactions. Unlike traditional medical benchmarks that rely on solving standardized test questions, the proposed approach models a multi-step clinical dialogue in which either a physician or an AI system must collect medical history, analyze attached materials (including laboratory reports, images, and medical documents), formulate differential diagnoses, and provide personalized recommendations. System performance is evaluated using the D.O.T.S. metric, which consists of four components: Diagnosis, Observations/Investigations, Treatment, and Step Count, enabling assessment of both clinical correctness and dialogue efficiency. The system also incorporates a multi-level testing and quality monitoring architecture designed to detect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
