WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace   Setting

Olly Styles; Sam Miller; Patricio Cerda-Mardini; Tanaya Guha; Victor; Sanchez; Bertie Vidgen

arXiv:2405.00823·cs.CL·August 6, 2024·1 cites

WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting

Olly Styles, Sam Miller, Patricio Cerda-Mardini, Tanaya Guha, Victor, Sanchez, Bertie Vidgen

PDF

Open Access 1 Repo

TL;DR

WorkBench is a comprehensive benchmark dataset designed to evaluate AI agents' ability to perform realistic workplace tasks, highlighting current limitations and providing a robust platform for future development.

Contribution

The paper introduces WorkBench, a novel, outcome-centric benchmark dataset with a sandbox environment for assessing agents in workplace scenarios, enabling automated evaluation.

Findings

01

Agents perform poorly on the benchmark, with success rates as low as 3%.

02

GPT-4 achieves the highest success rate at 43%.

03

Agents often make errors like sending emails to the wrong recipients.

Abstract

We introduce WorkBench: a benchmark dataset for evaluating agents' ability to execute tasks in a workplace setting. WorkBench contains a sandbox environment with five databases, 26 tools, and 690 tasks. These tasks represent common business activities, such as sending emails and scheduling meetings. The tasks in WorkBench are challenging as they require planning, tool selection, and often multiple actions. If a task has been successfully executed, one (or more) of the database values may change. The correct outcome for each task is unique and unambiguous, which allows for robust, automated evaluation. We call this key contribution outcome-centric evaluation. We evaluate five existing ReAct agents on WorkBench, finding they successfully complete as few as 3% of tasks (Llama2-70B), and just 43% for the best-performing (GPT-4). We further find that agents' errors can result in the wrong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

olly-styles/workbench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Economy and Work Transformation · Business Process Modeling and Analysis · Multi-Agent Systems and Negotiation