AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles; Sarah Clinckemaillie; Yifan Chang; Jonathan Waltz,; Gabrielle Lau; Marybeth Fair; Alice Li; William Bishop; Wei Li; Folawiyo; Campbell-Ajala; Daniel Toyama; Robert Berry; Divya Tyamagundlu; Timothy; Lillicrap; Oriana Riva

arXiv:2405.14573·cs.AI·April 8, 2025·1 cites

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz,, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo, Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy, Lillicrap, Oriana Riva

PDF

Open Access 1 Repo 3 Models 1 Datasets 3 Reviews

TL;DR

AndroidWorld is a dynamic, reproducible Android benchmarking environment with 116 real-world tasks, enabling realistic testing of autonomous agents and highlighting the challenges of cross-platform generalization and task variability.

Contribution

We introduce AndroidWorld, a novel dynamic Android benchmark with parameterized tasks, enhancing realism and reproducibility for evaluating autonomous agents.

Findings

01

Best agent completes 30.6% of tasks

02

Web agents are less effective on mobile platforms

03

Task variations significantly impact agent performance

Abstract

Autonomous agents that execute human tasks by controlling computers can enhance human productivity and application accessibility. However, progress in this field will be driven by realistic and reproducible benchmarks. We present AndroidWorld, a fully functional Android environment that provides reward signals for 116 programmatic tasks across 20 real-world Android apps. Unlike existing interactive environments, which provide a static test set, AndroidWorld dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways, thus enabling testing on a much larger and more realistic suite of tasks. To ensure reproducibility, each task includes dedicated initialization, success-checking, and tear-down logic, which modifies and inspects the device's system state. We experiment with baseline agents to test AndroidWorld and provide initial results on the…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- AndroidWorld’s dynamic task construction introduces extensive variability in task conditions, offering a realistic and reproducible environment for testing autonomous agents on Android. - The paper provides a robust baseline evaluation and a thorough performance analysis across real-world conditions. - Extensive experiments provide essential insights into current agents' limitations and suggest potential pathways for improvement in future cross-platform agent designs.

Weaknesses

- The agents achieve a low overall success rate (30.6%), which, while reflecting the environment’s complexity, suggests that current methods may need significant refinement to handle mobile platforms effectively. - Though this paper introduces extensive task parameterization, it does not provide a detailed explanation of how specific parameter variations impact agent performance. There is also no analysis of performance changes across different task components or parameters. - Large foundation m

Reviewer 02Rating 8Confidence 4

Strengths

S1 - This work presents a foundational contribution that advances the AI community's understanding of **mobile device control**, offering a robust framework for evaluating LLM agents in interactive environments. S2 - The related work and comparisons to existing benchmarks are well-organized and comprehensive. This provides a clear context of how this benchmark builds upon and differentiates itself from prior studies. S3 - The benchmark introduces an important challenge related to task generali

Weaknesses

W1 - The explanation of the action space could be more detailed. For instance, in the ACTION_TYPE section described in Appendix B.2, further clarification on the purpose of the "STATUS" action would be helpful. Additionally, what is the rationale behind the necessity of a "SWIPE" action, despite the existence of "SCROLL", could be further justified. W2 - Although line 257 mentions a limited set of high-level APIs, Appendix B lacks specific details on these APIs. More explanation would help us

Reviewer 03Rating 6Confidence 4

Strengths

1. Solid contribution: this benchmarks fills the gap for solid, reproducible, and executable benchmarks for Android device control tasks. 2. Good presentation: the presentation is clear, discussions in related works are comprehensive 3. Interesting experiments: the experiments presents a few good baselines and shows how human performs on the tasks. The robustness analysis also provides new information to the community.

Weaknesses

1. Lack of many real-world apps and tasks: The benchmark lacks many real-world applications and tasks, as ensuring full reproducibility and automated evaluation makes it impossible to include closed-source apps like YouTube, Twitter (X), Amazon, or actual real-web browsing. This results in inherent sim-to-real gaps. 2. Lack of device diversity on android device/OS: The emulated device is fixed as a Pixel 6 running Android 13. However, in practice, what we care about is agents' performance across

Code & Models

Repositories

google-research/android_world
noneOfficial

Models

Datasets

OS-Copilot/OS-Atlas-data
dataset· 2.2k dl
2.2k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Mobile Agent-Based Network Management · Scheduling and Optimization Algorithms