ART: Action-based Reasoning Task Benchmarking for Medical AI Agents

Ananya Mantravadi; Shivali Dalmia; Abhishek Mukherji

arXiv:2601.08988·cs.AI·January 15, 2026

ART: Action-based Reasoning Task Benchmarking for Medical AI Agents

Ananya Mantravadi, Shivali Dalmia, Abhishek Mukherji

PDF

Open Access 1 Video

TL;DR

ART is a new benchmark for evaluating medical AI agents' ability to perform complex, action-based reasoning tasks on real-world electronic health records, highlighting current strengths and weaknesses.

Contribution

The paper introduces ART, a comprehensive, clinically validated benchmark for assessing action-based reasoning in medical AI, addressing gaps in existing evaluation methods.

Findings

01

GPT-4o-mini and Claude 3.5 Sonnet excel in retrieval after prompt refinement.

02

Significant gaps remain in aggregation (28-64%) and threshold reasoning (32-38%).

03

ART exposes key failure modes in current medical AI reasoning capabilities.

Abstract

Reliable clinical decision support requires medical AI agents capable of safe, multi-step reasoning over structured electronic health records (EHRs). While large language models (LLMs) show promise in healthcare, existing benchmarks inadequately assess performance on action-based tasks involving threshold evaluation, temporal aggregation, and conditional logic. We introduce ART, an Action-based Reasoning clinical Task benchmark for medical AI agents, which mines real-world EHR data to create challenging tasks targeting known reasoning weaknesses. Through analysis of existing benchmarks, we identify three dominant error categories: retrieval failures, aggregation errors, and conditional logic misjudgments. Our four-stage pipeline -- scenario identification, task generation, quality audit, and evaluation -- produces diverse, clinically validated tasks grounded in real patient data.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ART: Action-based Reasoning Task Benchmarking for Medical AI Agents· underline

Taxonomy

TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Topic Modeling