
TL;DR
AutomationBench is a new benchmark designed to evaluate AI agents on complex cross-application workflows involving API discovery, policy adherence, and multi-system coordination, reflecting real business needs.
Contribution
It introduces a realistic, multi-domain benchmark based on Zapier workflows, emphasizing endpoint discovery, policy following, and end-state correctness for AI agents.
Findings
Current frontier models score below 10% on AutomationBench.
The benchmark captures real-world complexity of business workflows.
AutomationBench provides a challenging environment for evaluating AI agent capabilities.
Abstract
Existing AI benchmarks for software automation rarely combine cross-application coordination, autonomous API discovery, and policy adherence. Real business workflows demand all three: a single task may span a CRM, inbox, calendar, and messaging platform - requiring the agent to find the right endpoints, follow a policy document, and write correct data to each system. To address this gap, we introduce AutomationBench, a benchmark for evaluating AI agents on cross-application workflow orchestration via REST APIs. Drawing on real workflow patterns from Zapier's platform, tasks span Sales, Marketing, Operations, Support, Finance, and HR domains. Agents must discover relevant endpoints themselves, follow layered business rules, and navigate environments with irrelevant and sometimes misleading records. Grading is programmatic and end-state only: whether the correct data ended up in the right…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
