WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting
Olly Styles, Sam Miller, Patricio Cerda-Mardini, Tanaya Guha, Victor, Sanchez, Bertie Vidgen

TL;DR
WorkBench is a comprehensive benchmark dataset designed to evaluate AI agents' ability to perform realistic workplace tasks, highlighting current limitations and providing a robust platform for future development.
Contribution
The paper introduces WorkBench, a novel, outcome-centric benchmark dataset with a sandbox environment for assessing agents in workplace scenarios, enabling automated evaluation.
Findings
Agents perform poorly on the benchmark, with success rates as low as 3%.
GPT-4 achieves the highest success rate at 43%.
Agents often make errors like sending emails to the wrong recipients.
Abstract
We introduce WorkBench: a benchmark dataset for evaluating agents' ability to execute tasks in a workplace setting. WorkBench contains a sandbox environment with five databases, 26 tools, and 690 tasks. These tasks represent common business activities, such as sending emails and scheduling meetings. The tasks in WorkBench are challenging as they require planning, tool selection, and often multiple actions. If a task has been successfully executed, one (or more) of the database values may change. The correct outcome for each task is unique and unambiguous, which allows for robust, automated evaluation. We call this key contribution outcome-centric evaluation. We evaluate five existing ReAct agents on WorkBench, finding they successfully complete as few as 3% of tasks (Llama2-70B), and just 43% for the best-performing (GPT-4). We further find that agents' errors can result in the wrong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Economy and Work Transformation · Business Process Modeling and Analysis · Multi-Agent Systems and Negotiation
