AppWorld: A Controllable World of Apps and People for Benchmarking   Interactive Coding Agents

Harsh Trivedi; Tushar Khot; Mareike Hartmann; Ruskin Manku; Vinty; Dong; Edward Li; Shashank Gupta; Ashish Sabharwal; Niranjan Balasubramanian

arXiv:2407.18901·cs.SE·July 29, 2024·1 cites

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty, Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, Niranjan Balasubramanian

PDF

Open Access 1 Repo

TL;DR

AppWorld introduces a comprehensive environment and benchmark for evaluating autonomous agents' ability to perform complex, multi-app digital tasks through rich code generation, addressing limitations of existing benchmarks.

Contribution

We developed AppWorld Engine and AppWorld Benchmark, enabling realistic, diverse, and challenging tasks for assessing interactive coding agents' capabilities.

Findings

01

GPT-4o solves ~49% of normal tasks

02

GPT-4o solves ~30% of challenge tasks

03

The benchmark is highly challenging for current models

Abstract

Autonomous agents that address day-to-day digital tasks (e.g., ordering groceries for a household), must not only operate multiple apps (e.g., notes, messaging, shopping app) via APIs, but also generate rich code with complex control flow in an iterative manner based on their interaction with the environment. However, existing benchmarks for tool use are inadequate, as they only cover tasks that require a simple sequence of API calls. To remedy this gap, we built $AppWorld Engine$ , a high-quality execution environment (60K lines of code) of 9 day-to-day apps operable via 457 APIs and populated with realistic digital activities simulating the lives of ~100 fictitious users. We then created $AppWorld Benchmark$ (40K lines of code), a suite of 750 natural, diverse, and challenging autonomous agent tasks requiring rich and interactive code generation. It supports robust…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stonybrooknlp/appworld
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimedia Communication and Technology · Smart Parking Systems Research