Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI

Anna Kozlova; Stanislau Salavei; Pavel Satalkin; Hanna Plotnitskaya; Sergey Parfenyuk

arXiv:2603.25821·cs.CL·March 30, 2026

Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI

Anna Kozlova, Stanislau Salavei, Pavel Satalkin, Hanna Plotnitskaya, Sergey Parfenyuk

PDF

TL;DR

Doctorina MedBench is a comprehensive framework for evaluating agent-based medical AI through realistic clinical dialogue simulations, assessing diagnostic accuracy, efficiency, and safety in a variety of scenarios.

Contribution

It introduces a novel simulation-based evaluation framework with multi-level testing and a new metric, D.O.T.S., for assessing medical AI performance in realistic interactions.

Findings

01

Simulation provides more realistic assessment of clinical competence.

02

The framework supports safety monitoring and regression testing.

03

Over 1,000 clinical cases with 750 diagnoses are included.

Abstract

We present Doctorina MedBench, a comprehensive evaluation framework for agent-based medical AI based on the simulation of realistic physician-patient interactions. Unlike traditional medical benchmarks that rely on solving standardized test questions, the proposed approach models a multi-step clinical dialogue in which either a physician or an AI system must collect medical history, analyze attached materials (including laboratory reports, images, and medical documents), formulate differential diagnoses, and provide personalized recommendations. System performance is evaluated using the D.O.T.S. metric, which consists of four components: Diagnosis, Observations/Investigations, Treatment, and Step Count, enabling assessment of both clinical correctness and dialogue efficiency. The system also incorporates a multi-level testing and quality monitoring architecture designed to detect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.