CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

Haolin Chen; Deon Metelski; Leon Qi; Tao Xia; Joonyul Lee; Steve Brown; Kevin Riley; Frank Wang; T. Y. Alvin Liu; Hank Capps MD; Zeyu Tang; Xiangchen Song; Lingjing Kong; Fan Feng; Tianyi Zeng; Zhiwei Liu; Zixian Ma; Hang Jiang; Fangli Geng; Yuan Yuan; Chenyu You; Qingsong Wen; Hua Wei; Yanjie Fu; Yue Zhao; Carl Yang; Biwei Huang; Kun Zhang; Caiming Xiong; Sanmi Koyejo; Eric P. Xing; Philip S. Yu; Weiran Yao

arXiv:2605.16679·cs.CL·May 20, 2026

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

Haolin Chen, Deon Metelski, Leon Qi, Tao Xia, Joonyul Lee, Steve Brown, Kevin Riley, Frank Wang, T. Y. Alvin Liu, Hank Capps MD, Zeyu Tang, Xiangchen Song, Lingjing Kong, Fan Feng, Tianyi Zeng, Zhiwei Liu, Zixian Ma, Hang Jiang, Fangli Geng, Yuan Yuan, Chenyu You, Qingsong Wen

PDF

1 Repo 3 Datasets

TL;DR

This paper introduces CHI-Bench, a comprehensive healthcare workflow benchmark testing AI agents on complex, multi-role, long-horizon tasks involving policy-rich decision-making and multi-turn interactions.

Contribution

It presents a new benchmark with high-fidelity simulations and detailed tasks to evaluate AI performance on realistic healthcare workflows, highlighting current limitations.

Findings

01

Best agent resolves only 28% of tasks

02

No agent clears 20% on strict pass criteria

03

Performance drops to 3.8% when executing all tasks in a single session

Abstract

End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce $χ$ -Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

actava-ai/chi-bench
github

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.