SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation

Jingxuan Chen; Derek Yuen; Bin Xie; Yuhao Yang; Gongwei Chen; Zhihao; Wu; Li Yixing; Xurui Zhou; Weiwen Liu; Shuai Wang; Kaiwen Zhou; Rui Shao,; Liqiang Nie; Yasheng Wang; Jianye Hao; Jun Wang; Kun Shao

arXiv:2410.15164·cs.AI·April 2, 2025

SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation

Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao, Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Kaiwen Zhou, Rui Shao,, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao

PDF

Open Access 1 Repo 3 Reviews

TL;DR

SPA-Bench is a comprehensive benchmark for evaluating multimodal large language model-based smartphone agents through diverse tasks, real-time interaction, and multi-dimensional performance assessment, advancing the development of practical mobile AI assistants.

Contribution

It introduces a new benchmark with diverse tasks, a flexible interaction framework, and an automatic evaluation pipeline for assessing smartphone agents.

Findings

01

Challenges in interpreting mobile interfaces and action grounding.

02

Memory retention and execution cost issues identified.

03

Future directions proposed to improve real-world applicability.

Abstract

Smartphone agents are increasingly important for helping users control devices efficiently, with (Multimodal) Large Language Model (MLLM)-based approaches emerging as key contenders. Fairly comparing these agents is essential but challenging, requiring a varied task scope, the integration of agents with different implementations, and a generalisable evaluation pipeline to assess their strengths and weaknesses. In this paper, we present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents in an interactive environment that simulates real-world conditions. SPA-Bench offers three key contributions: (1) A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines; (2) A plug-and-play framework enabling real-time agent interaction with Android devices, integrating over…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 3

Strengths

1. Diverse task collection of 340 tasks 2. Automated evaluation pipeline consists of multiple evaluation metrics

Weaknesses

Paper can be better organized.

Reviewer 02Rating 8Confidence 4

Strengths

1.This paper studies an important and timely problem. Developing a comprehensive benchmark is important to advance LLM agents in smartphone applications. 2.The present work includes comprehensive experiments with 11 LLM agents on diverse tasks. The findings derived from these experiments could provide insights for future LLM agent design. 3.This paper introduces a plug-and-play framework, which could facilitate real-time interaction with Android devices. It is important to support diverse tas

Weaknesses

1.The evaluation tasks are constructed by human annotators. It is fine, but I think the present work will benefit a lot if the authors could discuss more about the process of recruiting human annotators and protocols for quality control. It will also be great if the authors can discuss the possibility of synthesizing behavior trajectory, which might extend the impact of present work from solely evaluation to fine-tuning or even pre-training. 2.In addition to off-the-shelf LLM agents and fine-tu

Reviewer 03Rating 6Confidence 3

Strengths

S1 - This benchmark offers a notable addition to the community by employing Chinese applications. This expands the scope and applicability of benchmarking in multilingual contexts. S2 - The benchmark includes diverse tasks, levels, and metrics. It also introduces several new evaluation metrics. Furthermore, a novel coarse-to-fine evaluation approach has been proposed.   S3 - The authors provided comprehensive benchmark references and included many of the existing baselines. Hence, this study o

Weaknesses

W1 - This benchmark’s unique challenge is unclear. The authors should highlight the specific interesting challenge or novel difficulties it poses compared to existing benchmarks. W2 - While the newly introduced termination metrics are intriguing and valuable, they seem to require more careful design. Could the authors propose a strategy for applying these metrics to agents that do not produce language-based rationales for termination? (Minor) Additionally, it would be helpful if the authors cou

Code & Models

Repositories

ai-agents-2030/SPA-Bench
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPeer-to-Peer Network Technologies · Mobile Agent-Based Network Management