TL;DR
This paper evaluates the robustness of mobile GUI agents powered by large language models against real-world threats, revealing significant performance degradation due to untrustworthy third-party content.
Contribution
It introduces a scalable content instrumentation framework and a comprehensive benchmark to test GUI agents under realistic, challenging app scenarios.
Findings
Agents' performance significantly degrades with third-party content.
Misleading rate averages 42.0% in dynamic and 36.1% in static environments.
The framework and benchmark are publicly released for further research.
Abstract
Recent years have witnessed a rapid development of mobile GUI agents powered by large language models (LLMs), which can autonomously execute diverse device-control tasks based on natural language instructions. The increasing accuracy of these agents on standard benchmarks has raised expectations for large-scale real-world deployment, and there are already several commercial agents released and used by early adopters. However, are we really ready for GUI agents integrated into our daily devices as system building blocks? We argue that an important pre-deployment validation is missing to examine whether the agents can maintain their performance under real-world threats. Specifically, unlike existing common benchmarks that are based on simple static app contents (they have to do so to ensure environment consistency between different tests), real-world apps are filled with contents from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
