AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz,, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo, Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy, Lillicrap, Oriana Riva

TL;DR
AndroidWorld is a dynamic, reproducible Android benchmarking environment with 116 real-world tasks, enabling realistic testing of autonomous agents and highlighting the challenges of cross-platform generalization and task variability.
Contribution
We introduce AndroidWorld, a novel dynamic Android benchmark with parameterized tasks, enhancing realism and reproducibility for evaluating autonomous agents.
Findings
Best agent completes 30.6% of tasks
Web agents are less effective on mobile platforms
Task variations significantly impact agent performance
Abstract
Autonomous agents that execute human tasks by controlling computers can enhance human productivity and application accessibility. However, progress in this field will be driven by realistic and reproducible benchmarks. We present AndroidWorld, a fully functional Android environment that provides reward signals for 116 programmatic tasks across 20 real-world Android apps. Unlike existing interactive environments, which provide a static test set, AndroidWorld dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways, thus enabling testing on a much larger and more realistic suite of tasks. To ensure reproducibility, each task includes dedicated initialization, success-checking, and tear-down logic, which modifies and inspects the device's system state. We experiment with baseline agents to test AndroidWorld and provide initial results on the…
Peer Reviews
Decision·ICLR 2025 Poster
- AndroidWorld’s dynamic task construction introduces extensive variability in task conditions, offering a realistic and reproducible environment for testing autonomous agents on Android. - The paper provides a robust baseline evaluation and a thorough performance analysis across real-world conditions. - Extensive experiments provide essential insights into current agents' limitations and suggest potential pathways for improvement in future cross-platform agent designs.
- The agents achieve a low overall success rate (30.6%), which, while reflecting the environment’s complexity, suggests that current methods may need significant refinement to handle mobile platforms effectively. - Though this paper introduces extensive task parameterization, it does not provide a detailed explanation of how specific parameter variations impact agent performance. There is also no analysis of performance changes across different task components or parameters. - Large foundation m
S1 - This work presents a foundational contribution that advances the AI community's understanding of **mobile device control**, offering a robust framework for evaluating LLM agents in interactive environments. S2 - The related work and comparisons to existing benchmarks are well-organized and comprehensive. This provides a clear context of how this benchmark builds upon and differentiates itself from prior studies. S3 - The benchmark introduces an important challenge related to task generali
W1 - The explanation of the action space could be more detailed. For instance, in the ACTION_TYPE section described in Appendix B.2, further clarification on the purpose of the "STATUS" action would be helpful. Additionally, what is the rationale behind the necessity of a "SWIPE" action, despite the existence of "SCROLL", could be further justified. W2 - Although line 257 mentions a limited set of high-level APIs, Appendix B lacks specific details on these APIs. More explanation would help us
1. Solid contribution: this benchmarks fills the gap for solid, reproducible, and executable benchmarks for Android device control tasks. 2. Good presentation: the presentation is clear, discussions in related works are comprehensive 3. Interesting experiments: the experiments presents a few good baselines and shows how human performs on the tasks. The robustness analysis also provides new information to the community.
1. Lack of many real-world apps and tasks: The benchmark lacks many real-world applications and tasks, as ensuring full reproducibility and automated evaluation makes it impossible to include closed-source apps like YouTube, Twitter (X), Amazon, or actual real-web browsing. This results in inherent sim-to-real gaps. 2. Lack of device diversity on android device/OS: The emulated device is fixed as a Pixel 6 running Android 13. However, in practice, what we care about is agents' performance across
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Mobile Agent-Based Network Management · Scheduling and Optimization Algorithms
