Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction
Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan, Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, Kai Yu

TL;DR
This paper introduces Mobile-Env, a toolkit for creating reliable GUI benchmarks in Android, enabling better evaluation of LLM and VLM agents, and revealing current models' limitations in real-world tasks.
Contribution
We present Mobile-Env, a new toolkit for building qualified GUI benchmarks in Android, facilitating trustworthy and reproducible evaluations of LLM and VLM agents.
Findings
Advanced models like GPT-4V and LLaMA-3 struggle with simple GUI tasks.
Mobile-Env enables comprehensive evaluation across real-world apps.
Current models reveal significant gaps in GUI task performance.
Abstract
The Graphical User Interface (GUI) is pivotal for human interaction with the digital world, enabling efficient device control and the completion of complex tasks. Recent progress in Large Language Models (LLMs) and Vision Language Models (VLMs) offers the chance to create advanced GUI agents. To ensure their effectiveness, there's a pressing need for qualified benchmarks that provide trustworthy and reproducible evaluations -- a challenge current benchmarks often fail to address. To tackle this issue, we introduce Mobile-Env, a comprehensive toolkit tailored for creating GUI benchmarks in the Android mobile environment. Mobile-Env offers an isolated and controllable setting for reliable evaluations, and accommodates intermediate instructions and rewards to reflect real-world usage more naturally. Utilizing Mobile-Env, we collect an open-world task set across various real-world apps and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · AI in Service Interactions
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Residual Connection · Linear Layer · Discriminative Fine-Tuning · Byte Pair Encoding · Linear Warmup With Cosine Annealing · Weight Decay · Dropout
