Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents

Shihan Deng; Weikai Xu; Hongda Sun; Wei Liu; Tao Tan; Jianfeng Liu,; Ang Li; Jian Luan; Bin Wang; Rui Yan; Shuo Shang

arXiv:2407.00993·cs.AI·July 2, 2024·1 cites

Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents

Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Jianfeng Liu,, Ang Li, Jian Luan, Bin Wang, Rui Yan, Shuo Shang

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

Mobile-Bench is a comprehensive evaluation benchmark for LLM-based mobile agents, addressing current limitations by expanding UI operations, collecting diverse data, and introducing a new metric to assess planning and reasoning capabilities.

Contribution

The paper introduces Mobile-Bench, a novel benchmark with expanded APIs, diverse data categories, and a new evaluation metric for assessing LLM mobile agents.

Findings

01

Expanded API set with 103 functions improves task efficiency.

02

Data categorized into SAST, SAMT, MAMT for different complexity levels.

03

CheckPoint metric accurately evaluates planning and reasoning steps.

Abstract

With the remarkable advancements of large language models (LLMs), LLM-based agents have become a research hotspot in human-computer interaction. However, there is a scarcity of benchmarks available for LLM-based mobile agents. Benchmarking these agents generally faces three main challenges: (1) The inefficiency of UI-only operations imposes limitations to task evaluation. (2) Specific instructions within a singular application lack adequacy for assessing the multi-dimensional reasoning and decision-making capacities of LLM mobile agents. (3) Current evaluation metrics are insufficient to accurately assess the process of sequential actions. To this end, we propose Mobile-Bench, a novel benchmark for evaluating the capabilities of LLM-based mobile agents. First, we expand conventional UI operations by incorporating 103 collected APIs to accelerate the efficiency of task completion.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

XiaoMi/MobileBench
noneOfficial

Datasets

xwk123/MobileBench-v1
dataset· 55 dl
55 dl

Videos

Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents· underline

Taxonomy

TopicsMobile Agent-Based Network Management · Peer-to-Peer Network Technologies · Multi-Agent Systems and Negotiation