SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation
Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao, Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Kaiwen Zhou, Rui Shao,, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao

TL;DR
SPA-Bench is a comprehensive benchmark for evaluating multimodal large language model-based smartphone agents through diverse tasks, real-time interaction, and multi-dimensional performance assessment, advancing the development of practical mobile AI assistants.
Contribution
It introduces a new benchmark with diverse tasks, a flexible interaction framework, and an automatic evaluation pipeline for assessing smartphone agents.
Findings
Challenges in interpreting mobile interfaces and action grounding.
Memory retention and execution cost issues identified.
Future directions proposed to improve real-world applicability.
Abstract
Smartphone agents are increasingly important for helping users control devices efficiently, with (Multimodal) Large Language Model (MLLM)-based approaches emerging as key contenders. Fairly comparing these agents is essential but challenging, requiring a varied task scope, the integration of agents with different implementations, and a generalisable evaluation pipeline to assess their strengths and weaknesses. In this paper, we present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents in an interactive environment that simulates real-world conditions. SPA-Bench offers three key contributions: (1) A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines; (2) A plug-and-play framework enabling real-time agent interaction with Android devices, integrating over…
Peer Reviews
Decision·ICLR 2025 Spotlight
1. Diverse task collection of 340 tasks 2. Automated evaluation pipeline consists of multiple evaluation metrics
Paper can be better organized.
1.This paper studies an important and timely problem. Developing a comprehensive benchmark is important to advance LLM agents in smartphone applications. 2.The present work includes comprehensive experiments with 11 LLM agents on diverse tasks. The findings derived from these experiments could provide insights for future LLM agent design. 3.This paper introduces a plug-and-play framework, which could facilitate real-time interaction with Android devices. It is important to support diverse tas
1.The evaluation tasks are constructed by human annotators. It is fine, but I think the present work will benefit a lot if the authors could discuss more about the process of recruiting human annotators and protocols for quality control. It will also be great if the authors can discuss the possibility of synthesizing behavior trajectory, which might extend the impact of present work from solely evaluation to fine-tuning or even pre-training. 2.In addition to off-the-shelf LLM agents and fine-tu
S1 - This benchmark offers a notable addition to the community by employing Chinese applications. This expands the scope and applicability of benchmarking in multilingual contexts. S2 - The benchmark includes diverse tasks, levels, and metrics. It also introduces several new evaluation metrics. Furthermore, a novel coarse-to-fine evaluation approach has been proposed. S3 - The authors provided comprehensive benchmark references and included many of the existing baselines. Hence, this study o
W1 - This benchmark’s unique challenge is unclear. The authors should highlight the specific interesting challenge or novel difficulties it poses compared to existing benchmarks. W2 - While the newly introduced termination metrics are intriguing and valuable, they seem to require more careful design. Could the authors propose a strategy for applying these metrics to agents that do not produce language-based rationales for termination? (Minor) Additionally, it would be helpful if the authors cou
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPeer-to-Peer Network Technologies · Mobile Agent-Based Network Management
