HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

Suhana Bedi; Ryan Welch; Ethan Steinberg; Michael Wornow; Taeil Matthew Kim; Haroun Ahmed; Peter Sterling; Bravim Purohit; Qurat Akram; Angelic Acosta; Esther Nubla; Pritika Sharma; Michael A. Pfeffer; Sanmi Koyejo; and Nigam H. Shah

arXiv:2604.09937·cs.AI·April 14, 2026

HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

Suhana Bedi, Ryan Welch, Ethan Steinberg, Michael Wornow, Taeil Matthew Kim, Haroun Ahmed, Peter Sterling, Bravim Purohit, Qurat Akram, Angelic Acosta, Esther Nubla, Pritika Sharma, Michael A. Pfeffer, Sanmi Koyejo, and Nigam H. Shah

PDF

TL;DR

HealthAdminBench is a comprehensive benchmark for evaluating the performance of computer-use agents on realistic healthcare administration tasks, highlighting significant gaps in current AI capabilities.

Contribution

The paper introduces a new benchmark with detailed tasks and evaluation points for assessing AI agents in healthcare administration workflows.

Findings

01

Best agent achieves only 36.3% task success

02

Subtask success rates are higher than end-to-end success

03

Current agents show a substantial gap from real-world requirements

Abstract

Healthcare administration accounts for over $1 trillion in annual spending, making it a promising target for LLM-based computer-use agents (CUAs). While clinical applications of LLMs have received significant attention, no benchmark exists for evaluating CUAs on end-to-end administrative workflows. To address this gap, we introduce HealthAdminBench, a benchmark comprising four realistic GUI environments: an EHR, two payer portals, and a fax system, and 135 expert-defined tasks spanning three administrative task types: Prior Authorization, Appeals and Denials Management, and Durable Medical Equipment (DME) Order Processing. Each task is decomposed into fine-grained, verifiable subtasks, yielding 1,698 evaluation points. We evaluate seven agent configurations under multiple prompting and observation settings and find that, despite strong subtask performance, end-to-end reliability remains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.