Evaluating Medical LLMs by Levels of Autonomy: A Survey Moving from Benchmarks to Applications

Xiao Ye; Jacob Dineen; Zhaonan Li; Zhikun Xu; Weiyu Chen; Shijie Lu; Yuxi Huang; Ming Shen; Phu Tran; Ji-Eun Irene Yum; Muhammad Ali Khan; Muhammad Umar Afzal; Irbaz Bin Riaz; and Ben Zhou

arXiv:2510.17764·cs.CL·October 21, 2025

Evaluating Medical LLMs by Levels of Autonomy: A Survey Moving from Benchmarks to Applications

Xiao Ye, Jacob Dineen, Zhaonan Li, Zhikun Xu, Weiyu Chen, Shijie Lu, Yuxi Huang, Ming Shen, Phu Tran, Ji-Eun Irene Yum, Muhammad Ali Khan, Muhammad Umar Afzal, Irbaz Bin Riaz, and Ben Zhou

PDF

Open Access

TL;DR

This survey redefines the evaluation of medical large language models by levels of autonomy, emphasizing risk-aware, application-oriented assessment over traditional benchmark scores to better ensure safe clinical deployment.

Contribution

It introduces a levels-of-autonomy framework for evaluating medical LLMs, linking benchmarks to clinical actions and risks, and guides credible, risk-aware assessment for real-world use.

Findings

01

Aligns benchmarks with autonomy levels and associated risks

02

Proposes a level-conditioned blueprint for evaluation and reporting

03

Moves evaluation focus from scores to clinical safety and reliability

Abstract

Medical Large language models achieve strong scores on standard benchmarks; however, the transfer of those results to safe and reliable performance in clinical workflows remains a challenge. This survey reframes evaluation through a levels-of-autonomy lens (L0-L3), spanning informational tools, information transformation and aggregation, decision support, and supervised agents. We align existing benchmarks and metrics with the actions permitted at each level and their associated risks, making the evaluation targets explicit. This motivates a level-conditioned blueprint for selecting metrics, assembling evidence, and reporting claims, alongside directions that link evaluation to oversight. By centering autonomy, the survey moves the field beyond score-based claims toward credible, risk-aware evidence for real clinical use.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Explainable Artificial Intelligence (XAI)