LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Task Automation
Li Zhang, Shihe Wang, Xianqing Jia, Zhihan Zheng, Yunhe Yan, Longxi, Gao, Yuanchun Li, Mengwei Xu

TL;DR
LlamaTouch introduces a scalable, faithful on-device testbed for mobile UI task automation evaluation, leveraging UI state transfer and multi-level matching to improve over traditional human validation methods.
Contribution
It presents a novel on-device evaluation framework with UI state transfer, detailed annotation, and multi-level matching, enabling scalable and faithful assessment of mobile agents.
Findings
High evaluation faithfulness demonstrated in real-world environments.
Better scalability compared to human validation methods.
Supports diverse mobile applications with multiple agents and tasks.
Abstract
The emergent large language/multimodal models facilitate the evolution of mobile agents, especially in mobile UI task automation. However, existing evaluation approaches, which rely on human validation or established datasets to compare agent-predicted actions with predefined action sequences, are unscalable and unfaithful. To overcome these limitations, this paper presents LlamaTouch, a testbed for on-device mobile UI task execution and faithful, scalable task evaluation. By observing that the task execution process only transfers UI states, LlamaTouch employs a novel evaluation approach that only assesses whether an agent traverses all manually annotated, essential application/system states. LlamaTouch comprises three key techniques: (1) On-device task execution that enables mobile agents to interact with realistic mobile environments for task execution. (2) Fine-grained UI component…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems · Interactive and Immersive Displays · IoT and Edge/Fog Computing
MethodsSparse Evolutionary Training
