Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI   Interaction

Danyang Zhang; Zhennan Shen; Rui Xie; Situo Zhang; Tianbao Xie; Zihan; Zhao; Siyuan Chen; Lu Chen; Hongshen Xu; Ruisheng Cao; Kai Yu

arXiv:2305.08144·cs.AI·June 14, 2024·1 cites

Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction

Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan, Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, Kai Yu

PDF

Open Access 2 Repos 2 Datasets

TL;DR

This paper introduces Mobile-Env, a toolkit for creating reliable GUI benchmarks in Android, enabling better evaluation of LLM and VLM agents, and revealing current models' limitations in real-world tasks.

Contribution

We present Mobile-Env, a new toolkit for building qualified GUI benchmarks in Android, facilitating trustworthy and reproducible evaluations of LLM and VLM agents.

Findings

01

Advanced models like GPT-4V and LLaMA-3 struggle with simple GUI tasks.

02

Mobile-Env enables comprehensive evaluation across real-world apps.

03

Current models reveal significant gaps in GUI task performance.

Abstract

The Graphical User Interface (GUI) is pivotal for human interaction with the digital world, enabling efficient device control and the completion of complex tasks. Recent progress in Large Language Models (LLMs) and Vision Language Models (VLMs) offers the chance to create advanced GUI agents. To ensure their effectiveness, there's a pressing need for qualified benchmarks that provide trustworthy and reproducible evaluations -- a challenge current benchmarks often fail to address. To tackle this issue, we introduce Mobile-Env, a comprehensive toolkit tailored for creating GUI benchmarks in the Android mobile environment. Mobile-Env offers an isolated and controllable setting for reliable evaluations, and accommodates intermediate instructions and rewards to reflect real-world usage more naturally. Utilizing Mobile-Env, we collect an open-world task set across various real-world apps and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · AI in Service Interactions

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Residual Connection · Linear Layer · Discriminative Fine-Tuning · Byte Pair Encoding · Linear Warmup With Cosine Annealing · Weight Decay · Dropout