Benchmarking large language model-based agent systems for clinical decision tasks

Yunsong Liu; Zunamys I. Carrero; Xiaofeng Jiang; Dyke Ferber; Georg Wölflein; Li Zhang; Sanddhya Jayabalan; Tim Lenz; Zhouguang Hui; Jakob Nikolas Kather

PMC · DOI:10.1038/s41746-026-02443-6·February 18, 2026

Benchmarking large language model-based agent systems for clinical decision tasks

Yunsong Liu, Zunamys I. Carrero, Xiaofeng Jiang, Dyke Ferber, Georg Wölflein, Li Zhang, Sanddhya Jayabalan, Tim Lenz, Zhouguang Hui, Jakob Nikolas Kather

PDF

Open Access

TL;DR

This paper benchmarks AI agent systems for clinical tasks, finding limited performance gains despite high resource use.

Contribution

The study introduces a systematic evaluation of agentic AI systems using diverse clinical benchmarks.

Findings

01

Agent systems showed only modest accuracy gains over baseline LLMs across clinical benchmarks.

02

Multimodal accuracy was low, and resource demands increased significantly.

03

Despite safeguards, hallucinations remained common in agent outputs.

Abstract

Agentic artificial intelligence (AI) systems, designed to autonomously reason, plan, and invoke tools, have shown promise in healthcare, yet systematic benchmarking of their real-world performance remains limited. In this study, we evaluate two such systems: the open-source OpenManus, built on Meta’s Llama-4 and extended with medically customized agents; and Manus, a proprietary agent system employing a multistep planner-executor-verifier architecture. Both systems were assessed across three benchmark families: AgentClinic, a stepwise dialog-based diagnostic simulation; MedAgentsBench, a knowledge-intensive medical QA dataset; and Humanity’s Last Exam (HLE), a suite of challenging text-only and multimodal questions. Despite access to advanced tools (e.g., web browsing, code development and execution, and text file editing) agent systems yielded only modest accuracy gains over baseline…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases1

hallucinations

Figures5

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Multi-Agent Systems and Negotiation