A3: Android Agent Arena for Mobile GUI Agents with Essential-State Procedural Evaluation
Yuxiang Chai, Shunye Tang, Han Xiao, Weifeng Lin, Hanhao Li, Jiayu Zhang, Liang Liu, Pengxiang Zhao, Guangyi Liu, Guozhi Wang, Shuai Ren, Rongduo Han, Haining Zhang, Siyuan Huang, Hongsheng Li

TL;DR
This paper introduces Android Agent Arena (A3), a comprehensive benchmark and evaluation system for mobile GUI agents that assesses performance in dynamic, real-world online apps using an innovative essential-state procedural approach.
Contribution
The paper presents a novel essential-state based procedural evaluation system and benchmark for mobile GUI agents, addressing limitations of static assessments and enabling more realistic performance measurement.
Findings
A3 includes 100 tasks from 20 popular online apps across diverse categories.
The evaluation method uses MLLMs as reward models for task verification.
The system facilitates data collection and environment management for mobile GUI research.
Abstract
The advancement of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has catalyzed the development of mobile graphic user interface (GUI) AI agents, which is designed to autonomously perform tasks on mobile devices. However, a significant gap persists in mobile GUI agent evaluation, where existing benchmarks predominantly rely on either static frame assessments such as AndroidControl or offline static apps such as AndroidWorld and thus fail to capture agent performance in dynamic, real-world online mobile apps. To address this gap, we present Android Agent Arena (A3), a novel "essential-state" based procedural evaluation system for mobile GUI agents. A3 introduces a benchmark of 100 tasks derived from 20 widely-used, dynamic online apps across 20 categories from the Google Play Store, ensuring evaluation comprehension. A3 also presents a novel "essential-state"…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Agent-Based Network Management · Peer-to-Peer Network Technologies
MethodsFocus
